Introduction

This documentation provides information on the state of databases(production and staging) as at January, 2023, the challenges and the steps taken for data repopulation and reproducibility.

SSN data was scrapped from different website that have data that are inline with the mission and vision of the company. As a result, their are some irregularities found and introduced in the data in process of cleaning and validation. The cleaning cleaning process for data scrapped on website A might be different from the cleaning processes for another and as such, the irregularities were introduced.

From the data inspection carried on production and staging db, we found out that these db have many irregularities in the data in them due to the trial and error processes might have been done by the backend developers towards finding the right data representation.

As at the point of writing this documentation, there is no clear documentation on how these data were cleaned or processed which have been a major problem in the data reproducibility.

Also, the version of both DBs does not allow total data extraction using the Neo4j APOC plugins. this is another major problem on total data export for preprocessing and cleaning.

Summary:

Data irregularities in both DB
No documentation on data processing for reproducibility
Incompatibility of APOC plugin on the DBs version for total data export

PreviousSchema Dictionary NextProposed Solution

Last updated 2 years ago

Was this helpful?