📋
Philanthrolab
  • Philanthrolab Technical Docs
  • SSN Component Library
  • Datalabs
    • Introduction
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
  • Social Safety Network
    • Introduction
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
      • V1
      • V2
  • SSN for Organisations
    • Introduction
    • Features and user stories
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
  • Developer Resources
    • Frontend Project Guide
    • Coding Guide
    • Creating a Neo4j instance on GCP vm
    • Set up local deploy for staging and production envs
    • Install ElasticSearch on GCP
    • ElasticSearch Query
    • ETL Strategy for Neo4j Database: Scraping, Transformation, and Enrichment
    • ETL Checklist
  • SSN Authentication
    • Introduction
    • Architecture
    • Schema
  • SSN Admin Dashboard
    • Introduction
    • Architecture
  • SSN Job Board
    • Introduction
    • Architecture
    • User Stories
    • Schema Dictionary
  • SSN Eligibility criteria AI feature
    • Introduction
    • Working Principles
    • Architecture
    • Schema Dictionary
  • DataBase Repopulation
    • Introduction
    • Proposed Solution
    • DB Details
    • Batch 1
  • LLM INTEGRATION
    • LLM Strategy and Implementation
Powered by GitBook
On this page

Was this helpful?

  1. DataBase Repopulation

Introduction

This documentation provides information on the state of databases(production and staging) as at January, 2023, the challenges and the steps taken for data repopulation and reproducibility.

SSN data was scrapped from different website that have data that are inline with the mission and vision of the company. As a result, their are some irregularities found and introduced in the data in process of cleaning and validation. The cleaning cleaning process for data scrapped on website A might be different from the cleaning processes for another and as such, the irregularities were introduced.

From the data inspection carried on production and staging db, we found out that these db have many irregularities in the data in them due to the trial and error processes might have been done by the backend developers towards finding the right data representation.

As at the point of writing this documentation, there is no clear documentation on how these data were cleaned or processed which have been a major problem in the data reproducibility.

Also, the version of both DBs does not allow total data extraction using the Neo4j APOC plugins. this is another major problem on total data export for preprocessing and cleaning.

Summary:

  1. Data irregularities in both DB

  2. No documentation on data processing for reproducibility

  3. Incompatibility of APOC plugin on the DBs version for total data export

PreviousSchema DictionaryNextProposed Solution

Last updated 2 years ago

Was this helpful?