# Architecture

The Datalabs API is used for interacting with data in our datalake. The overall design of the datalabs follows two main paradigm:

* A blob storage for raw data.
* A key-value store for data metadata.

Blob storage is used for storing raw data uploaded by users in varying format (CSV only as of now), and as such do not need a specific schema.  Metadata on the other hand is a description of raw data uploaded, and as such must conform to a standard schema. In the next section, we introduce our selected schema for metadata and the reasons why choose it.&#x20;

## Schema

Schema for data metadata follows the [Frictionless Data Specifications](https://specs.frictionlessdata.io/#overview) (FDS). FDS is an open set of patterns for describing data developed by [Frictionless Data](https://frictionlessdata.io/) in conjuncture with the [Open Knowledge Foundation](https://okfn.org/opendata/).&#x20;

At the core of FDS is a set of patterns for describing data including:

* [Data Package](https://specs.frictionlessdata.io/data-package/#language): A simple container format used to describe and package a collection of dataset or in frictionless terms resource.&#x20;
* [Data Resource:](https://specs.frictionlessdata.io/data-resource/#language) A single spec describing an individual data files
* [Table Schema:](https://specs.frictionlessdata.io/table-schema/#language) A simple format to declare a schema for tabular data. The schema is designed to be expressible in JSON.
* [Data Views:](https://specs.frictionlessdata.io/views) A simple format for describing a views on data that leverages existing specifications like Vega/Plotly and connects them with data provided in data packages or data resources.

### Why use FDS?

We decided to use the FDS because of the following reasons:

* It is simple
* Has great tooling ecosystem such as [frictionless.js](https://github.com/frictionlessdata/frictionless-js).
* It can be easily extended and customized
* It was built on existing standards and formats
* It can be used across a broad range of technologies

### High-Level Data Flow

This section graphically depicts the flow of data from users into our datalake and the processes involved in generating schema.

![Datalab data flow](https://1974413172-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MW3xxRi0Vf9N8eqeKB5%2F-MW87wYU4Q8OUNsjOTjv%2F-MW8X7SVOuiE-n1Fe_YG%2FScreen%20Shot%202021-03-19%20at%209.09.13%20AM.png?alt=media\&token=2059458d-c05f-482a-acf3-061887e39a59)

### Datalabs Metadata schema

In this section we present a sample schema which is generated when files are uploaded via a browser or from a URL. Under the hood we use the [frictionless.js ](https://github.com/frictionlessdata/frictionless-js)library to read such files and a standard FDS schema is auto-generated.&#x20;

This auto-generated schema contains major spec fields but has to be extended to include some of our custom fields.

A final schema with data description looks like the following:

```javascript
{
"resources": [
    {
        "path": "https://storage.googleapis.com/social-safety-datalake-prod/Daily Loading Profile 2012-Present-1615671473755/4f684370-e561-46e0-80a3-5c65aa2f617c.csv",
        "pathType": "remote",
        "name": "4f684370-e561-46e0-80a3-5c65aa2f617c",
        "format": "csv",
        "mediatype": "text/csv",
        "encoding": "utf-8",
        "hash": "9d88ea4d873740c74e522e2c9755e8be10be9d703fd901e269983379b9dbdde9"
    }
],
"name": "Daily Loading Profile 2012-Present",
"title": "Daily Loading Profile 2012-Present",
"description": "Citywide total daily electric load (usage) measured hourly, presented here as a daily total beginning on January 1, 2012 through last updated date.",
"access": "public",
"attribution": "City of Naperville, Illinois",
"attribution_link": "http://www.naperville.il.us/",
"version": 0.1,
"author": "603ce141b458952168bc532e",
"createdAt": "2021-03-13",
"updatedAt": "2021-03-13",
"updatedLastBy": "603ce141b458952168bc532e"

}
```

> See [schema definition page](https://tech.socialsafety.net/datalabs/schema-dictionary) for more details

### DataLabs API

The Datalabs api is published [here](https://documenter.getpostman.com/view/13777544/TVzaAZRr)

### Technology Stack

We are using the following tech stack for DataLabs API:

* **Programming Language**: JavaScript
* **Frameworks**: Nodejs and Express
* **Object Storage:** ElasticSearch
* **Blob Storage:** Google Cloud Buckets
* **In-memory Cache:** Redis

### Devops and Hosting

#### **Blob Storage**

There are two active GCS bucket for raw data storage. These are:

* Production blob bucket under the name [social-safety-datalake-prod](https://console.cloud.google.com/storage/browser/social-safety-datalake-prod;tab=objects?forceOnBucketsSortingFiltering=false\&project=social-safety), and
* Staging blob bucket under the name [social-safety-datalake-staging](https://console.cloud.google.com/storage/browser/social-safety-datalake-staging;tab=objects?forceOnBucketsSortingFiltering=false\&project=social-safety)

#### **Object Storage**

A custom Elasticsearch databases is hosted on GCP vm under the name [ssn-elasticsearch-prod](https://console.cloud.google.com/compute/instancesDetail/zones/us-central1-a/instances/ssn-elasticsearch-prod?project=social-safety).

#### DataLabs API hosting

The Datalabs API is hosted on Heroku under the pipeline name [ssn-datalabs](https://dashboard.heroku.com/pipelines/aee73da6-d75d-42e0-b3fc-a43922c1f79b). We have two hosted instances for both staging and production environment.&#x20;

* **Staging**: The staging environment is hosted under the name [ssn-datalake-staging](https://dashboard.heroku.com/apps/ssn-datalake-staging). Heroku CI/CD has been configured to automatically run tests and deploy a new version on every push to [dev](https://github.com/PhilanthroLab/SSN-DataLake/tree/dev) branch.&#x20;
* **Production**: The production environment is hosted under the name [ssn-datalake-production](https://dashboard.heroku.com/apps/ssn-datalake-production). Heroku CI/CD has been configured to automatically run tests and deploy a new version on every push to [main](https://github.com/PhilanthroLab/SSN-DataLake/tree/main) branch.&#x20;

### Development environment setup

For developers, to set up a local environment, follow the steps below:

* Ensure you have Nodejs (v12 and above), Yarn and Git installed.
* Clone the repository:

```
git clone https://github.com/PhilanthroLab/SSN-DataLake.git
```

* Cd into project folder and Install packages using yarn.

```bash
cd SSN-DataLake
yarn 
```

* Get your `.env` `.env_development`  `.env_production` and `google_bucket_key.json` files from your tech lead and add to your project folder.&#x20;

![](https://1974413172-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MW3xxRi0Vf9N8eqeKB5%2F-MW8_G1aPwE5CbD7L-wz%2F-MW8cp6KpyCH9OejUSbp%2FScreen%20Shot%202021-03-19%20at%209.38.50%20AM.png?alt=media\&token=f5ca4f63-c544-4231-9d58-e0843d908d6f)

* Start the application locally:

```
yarn start
```

* By default, your application starts on port 5051. Open your browser to localhost:5051, and you should see a page like the one below:

![](https://1974413172-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MW3xxRi0Vf9N8eqeKB5%2F-MW4HcNWzQB8vH5yGACq%2F-MW4Wq6j-uGOzX9u8I_-%2FScreen%20Shot%202021-03-18%20at%202.29.52%20PM.png?alt=media\&token=19912ba1-bc87-4f7f-981c-5271fd7b2b1a)

> We use postman for interacting and testing the DataLabs API. Go to the Philanthrolab [team workspace](https://philanthrolab.postman.co/workspace/Team-Workspace~36f68e3a-2d05-4f4b-831f-48d60fd98ea3/documentation/13777544-174fec2d-8630-46b6-909d-d7c78fc2a73e) on postman to interact with your local instance.&#x20;

For more developer resources go [here](https://tech.socialsafety.net/developer-resources/install-elasticsearch-on-gcp)
