Architecture

Architecture of Datalabs

The Datalabs API is used for interacting with data in our datalake. The overall design of the datalabs follows two main paradigm:

A blob storage for raw data.
A key-value store for data metadata.

Blob storage is used for storing raw data uploaded by users in varying format (CSV only as of now), and as such do not need a specific schema. Metadata on the other hand is a description of raw data uploaded, and as such must conform to a standard schema. In the next section, we introduce our selected schema for metadata and the reasons why choose it.

Schema

Schema for data metadata follows the Frictionless Data Specifications (FDS). FDS is an open set of patterns for describing data developed by Frictionless Data in conjuncture with the Open Knowledge Foundation.

At the core of FDS is a set of patterns for describing data including:

Data Package: A simple container format used to describe and package a collection of dataset or in frictionless terms resource.
Data Resource: A single spec describing an individual data files
Table Schema: A simple format to declare a schema for tabular data. The schema is designed to be expressible in JSON.
Data Views: A simple format for describing a views on data that leverages existing specifications like Vega/Plotly and connects them with data provided in data packages or data resources.

Why use FDS?

We decided to use the FDS because of the following reasons:

It is simple
Has great tooling ecosystem such as frictionless.js.
It can be easily extended and customized
It was built on existing standards and formats
It can be used across a broad range of technologies

High-Level Data Flow

This section graphically depicts the flow of data from users into our datalake and the processes involved in generating schema.

Datalabs Metadata schema

In this section we present a sample schema which is generated when files are uploaded via a browser or from a URL. Under the hood we use the frictionless.js library to read such files and a standard FDS schema is auto-generated.

This auto-generated schema contains major spec fields but has to be extended to include some of our custom fields.

A final schema with data description looks like the following:

{
"resources": [
    {
        "path": "https://storage.googleapis.com/social-safety-datalake-prod/Daily Loading Profile 2012-Present-1615671473755/4f684370-e561-46e0-80a3-5c65aa2f617c.csv",
        "pathType": "remote",
        "name": "4f684370-e561-46e0-80a3-5c65aa2f617c",
        "format": "csv",
        "mediatype": "text/csv",
        "encoding": "utf-8",
        "hash": "9d88ea4d873740c74e522e2c9755e8be10be9d703fd901e269983379b9dbdde9"
    }
],
"name": "Daily Loading Profile 2012-Present",
"title": "Daily Loading Profile 2012-Present",
"description": "Citywide total daily electric load (usage) measured hourly, presented here as a daily total beginning on January 1, 2012 through last updated date.",
"access": "public",
"attribution": "City of Naperville, Illinois",
"attribution_link": "http://www.naperville.il.us/",
"version": 0.1,
"author": "603ce141b458952168bc532e",
"createdAt": "2021-03-13",
"updatedAt": "2021-03-13",
"updatedLastBy": "603ce141b458952168bc532e"

}

See schema definition page for more details

DataLabs API

The Datalabs api is published here

Technology Stack

We are using the following tech stack for DataLabs API:

Programming Language: JavaScript
Frameworks: Nodejs and Express
Object Storage: ElasticSearch
Blob Storage: Google Cloud Buckets
In-memory Cache: Redis

Devops and Hosting

Blob Storage

There are two active GCS bucket for raw data storage. These are:

Production blob bucket under the name social-safety-datalake-prod, and
Staging blob bucket under the name social-safety-datalake-staging

Object Storage

A custom Elasticsearch databases is hosted on GCP vm under the name ssn-elasticsearch-prod.

DataLabs API hosting

The Datalabs API is hosted on Heroku under the pipeline name ssn-datalabs. We have two hosted instances for both staging and production environment.

Staging: The staging environment is hosted under the name ssn-datalake-staging. Heroku CI/CD has been configured to automatically run tests and deploy a new version on every push to dev branch.
Production: The production environment is hosted under the name ssn-datalake-production. Heroku CI/CD has been configured to automatically run tests and deploy a new version on every push to main branch.

Development environment setup

For developers, to set up a local environment, follow the steps below:

Ensure you have Nodejs (v12 and above), Yarn and Git installed.
Clone the repository:

git clone https://github.com/PhilanthroLab/SSN-DataLake.git

Cd into project folder and Install packages using yarn.

cd SSN-DataLake
yarn

Get your .env .env_development .env_production and google_bucket_key.json files from your tech lead and add to your project folder.

Start the application locally:

yarn start

By default, your application starts on port 5051. Open your browser to localhost:5051, and you should see a page like the one below:

We use postman for interacting and testing the DataLabs API. Go to the Philanthrolab team workspace on postman to interact with your local instance.

For more developer resources go here

PreviousIntroduction NextSchema Dictionary

Last updated 4 years ago

Was this helpful?