Architecture
Architecture of Datalabs
Last updated
Was this helpful?
Architecture of Datalabs
Last updated
Was this helpful?
The Datalabs API is used for interacting with data in our datalake. The overall design of the datalabs follows two main paradigm:
A blob storage for raw data.
A key-value store for data metadata.
Blob storage is used for storing raw data uploaded by users in varying format (CSV only as of now), and as such do not need a specific schema. Metadata on the other hand is a description of raw data uploaded, and as such must conform to a standard schema. In the next section, we introduce our selected schema for metadata and the reasons why choose it.
Schema for data metadata follows the (FDS). FDS is an open set of patterns for describing data developed by in conjuncture with the .
At the core of FDS is a set of patterns for describing data including:
: A simple container format used to describe and package a collection of dataset or in frictionless terms resource.
A single spec describing an individual data files
A simple format to declare a schema for tabular data. The schema is designed to be expressible in JSON.
A simple format for describing a views on data that leverages existing specifications like Vega/Plotly and connects them with data provided in data packages or data resources.
We decided to use the FDS because of the following reasons:
It is simple
Has great tooling ecosystem such as .
It can be easily extended and customized
It was built on existing standards and formats
It can be used across a broad range of technologies
This section graphically depicts the flow of data from users into our datalake and the processes involved in generating schema.
This auto-generated schema contains major spec fields but has to be extended to include some of our custom fields.
A final schema with data description looks like the following:
We are using the following tech stack for DataLabs API:
Programming Language: JavaScript
Frameworks: Nodejs and Express
Object Storage: ElasticSearch
Blob Storage: Google Cloud Buckets
In-memory Cache: Redis
There are two active GCS bucket for raw data storage. These are:
For developers, to set up a local environment, follow the steps below:
Ensure you have Nodejs (v12 and above), Yarn and Git installed.
Clone the repository:
Cd into project folder and Install packages using yarn.
Get your .env
.env_development
.env_production
and google_bucket_key.json
files from your tech lead and add to your project folder.
Start the application locally:
By default, your application starts on port 5051. Open your browser to localhost:5051, and you should see a page like the one below:
In this section we present a sample schema which is generated when files are uploaded via a browser or from a URL. Under the hood we use the library to read such files and a standard FDS schema is auto-generated.
See for more details
The Datalabs api is published
Production blob bucket under the name , and
Staging blob bucket under the name
A custom Elasticsearch databases is hosted on GCP vm under the name .
The Datalabs API is hosted on Heroku under the pipeline name . We have two hosted instances for both staging and production environment.
Staging: The staging environment is hosted under the name . Heroku CI/CD has been configured to automatically run tests and deploy a new version on every push to branch.
Production: The production environment is hosted under the name . Heroku CI/CD has been configured to automatically run tests and deploy a new version on every push to branch.
We use postman for interacting and testing the DataLabs API. Go to the Philanthrolab on postman to interact with your local instance.
For more developer resources go