📋
Philanthrolab
  • Philanthrolab Technical Docs
  • SSN Component Library
  • Datalabs
    • Introduction
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
  • Social Safety Network
    • Introduction
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
      • V1
      • V2
  • SSN for Organisations
    • Introduction
    • Features and user stories
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
  • Developer Resources
    • Frontend Project Guide
    • Coding Guide
    • Creating a Neo4j instance on GCP vm
    • Set up local deploy for staging and production envs
    • Install ElasticSearch on GCP
    • ElasticSearch Query
    • ETL Strategy for Neo4j Database: Scraping, Transformation, and Enrichment
    • ETL Checklist
  • SSN Authentication
    • Introduction
    • Architecture
    • Schema
  • SSN Admin Dashboard
    • Introduction
    • Architecture
  • SSN Job Board
    • Introduction
    • Architecture
    • User Stories
    • Schema Dictionary
  • SSN Eligibility criteria AI feature
    • Introduction
    • Working Principles
    • Architecture
    • Schema Dictionary
  • DataBase Repopulation
    • Introduction
    • Proposed Solution
    • DB Details
    • Batch 1
  • LLM INTEGRATION
    • LLM Strategy and Implementation
Powered by GitBook
On this page
  • Schema
  • Why use FDS?
  • High-Level Data Flow
  • Datalabs Metadata schema
  • DataLabs API
  • Technology Stack
  • Devops and Hosting
  • Development environment setup

Was this helpful?

  1. Datalabs

Architecture

Architecture of Datalabs

PreviousIntroductionNextSchema Dictionary

Last updated 4 years ago

Was this helpful?

The Datalabs API is used for interacting with data in our datalake. The overall design of the datalabs follows two main paradigm:

  • A blob storage for raw data.

  • A key-value store for data metadata.

Blob storage is used for storing raw data uploaded by users in varying format (CSV only as of now), and as such do not need a specific schema. Metadata on the other hand is a description of raw data uploaded, and as such must conform to a standard schema. In the next section, we introduce our selected schema for metadata and the reasons why choose it.

Schema

Schema for data metadata follows the (FDS). FDS is an open set of patterns for describing data developed by in conjuncture with the .

At the core of FDS is a set of patterns for describing data including:

  • : A simple container format used to describe and package a collection of dataset or in frictionless terms resource.

  • A single spec describing an individual data files

  • A simple format to declare a schema for tabular data. The schema is designed to be expressible in JSON.

  • A simple format for describing a views on data that leverages existing specifications like Vega/Plotly and connects them with data provided in data packages or data resources.

Why use FDS?

We decided to use the FDS because of the following reasons:

  • It is simple

  • Has great tooling ecosystem such as .

  • It can be easily extended and customized

  • It was built on existing standards and formats

  • It can be used across a broad range of technologies

High-Level Data Flow

This section graphically depicts the flow of data from users into our datalake and the processes involved in generating schema.

Datalabs Metadata schema

This auto-generated schema contains major spec fields but has to be extended to include some of our custom fields.

A final schema with data description looks like the following:

{
"resources": [
    {
        "path": "https://storage.googleapis.com/social-safety-datalake-prod/Daily Loading Profile 2012-Present-1615671473755/4f684370-e561-46e0-80a3-5c65aa2f617c.csv",
        "pathType": "remote",
        "name": "4f684370-e561-46e0-80a3-5c65aa2f617c",
        "format": "csv",
        "mediatype": "text/csv",
        "encoding": "utf-8",
        "hash": "9d88ea4d873740c74e522e2c9755e8be10be9d703fd901e269983379b9dbdde9"
    }
],
"name": "Daily Loading Profile 2012-Present",
"title": "Daily Loading Profile 2012-Present",
"description": "Citywide total daily electric load (usage) measured hourly, presented here as a daily total beginning on January 1, 2012 through last updated date.",
"access": "public",
"attribution": "City of Naperville, Illinois",
"attribution_link": "http://www.naperville.il.us/",
"version": 0.1,
"author": "603ce141b458952168bc532e",
"createdAt": "2021-03-13",
"updatedAt": "2021-03-13",
"updatedLastBy": "603ce141b458952168bc532e"

}

DataLabs API

Technology Stack

We are using the following tech stack for DataLabs API:

  • Programming Language: JavaScript

  • Frameworks: Nodejs and Express

  • Object Storage: ElasticSearch

  • Blob Storage: Google Cloud Buckets

  • In-memory Cache: Redis

Devops and Hosting

Blob Storage

There are two active GCS bucket for raw data storage. These are:

Object Storage

DataLabs API hosting

Development environment setup

For developers, to set up a local environment, follow the steps below:

  • Ensure you have Nodejs (v12 and above), Yarn and Git installed.

  • Clone the repository:

git clone https://github.com/PhilanthroLab/SSN-DataLake.git
  • Cd into project folder and Install packages using yarn.

cd SSN-DataLake
yarn 
  • Get your .env .env_development .env_production and google_bucket_key.json files from your tech lead and add to your project folder.

  • Start the application locally:

yarn start
  • By default, your application starts on port 5051. Open your browser to localhost:5051, and you should see a page like the one below:

In this section we present a sample schema which is generated when files are uploaded via a browser or from a URL. Under the hood we use the library to read such files and a standard FDS schema is auto-generated.

See for more details

The Datalabs api is published

Production blob bucket under the name , and

Staging blob bucket under the name

A custom Elasticsearch databases is hosted on GCP vm under the name .

The Datalabs API is hosted on Heroku under the pipeline name . We have two hosted instances for both staging and production environment.

Staging: The staging environment is hosted under the name . Heroku CI/CD has been configured to automatically run tests and deploy a new version on every push to branch.

Production: The production environment is hosted under the name . Heroku CI/CD has been configured to automatically run tests and deploy a new version on every push to branch.

We use postman for interacting and testing the DataLabs API. Go to the Philanthrolab on postman to interact with your local instance.

For more developer resources go

Frictionless Data Specifications
Frictionless Data
Open Knowledge Foundation
Data Package
Data Resource:
Table Schema:
Data Views:
frictionless.js
frictionless.js
schema definition page
here
social-safety-datalake-prod
social-safety-datalake-staging
ssn-elasticsearch-prod
ssn-datalabs
ssn-datalake-staging
dev
ssn-datalake-production
main
team workspace
here
Datalab data flow