📋
Philanthrolab
  • Philanthrolab Technical Docs
  • SSN Component Library
  • Datalabs
    • Introduction
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
  • Social Safety Network
    • Introduction
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
      • V1
      • V2
  • SSN for Organisations
    • Introduction
    • Features and user stories
    • Architecture
    • Schema Dictionary
    • Project Status/Timeline
  • Developer Resources
    • Frontend Project Guide
    • Coding Guide
    • Creating a Neo4j instance on GCP vm
    • Set up local deploy for staging and production envs
    • Install ElasticSearch on GCP
    • ElasticSearch Query
    • ETL Strategy for Neo4j Database: Scraping, Transformation, and Enrichment
    • ETL Checklist
  • SSN Authentication
    • Introduction
    • Architecture
    • Schema
  • SSN Admin Dashboard
    • Introduction
    • Architecture
  • SSN Job Board
    • Introduction
    • Architecture
    • User Stories
    • Schema Dictionary
  • SSN Eligibility criteria AI feature
    • Introduction
    • Working Principles
    • Architecture
    • Schema Dictionary
  • DataBase Repopulation
    • Introduction
    • Proposed Solution
    • DB Details
    • Batch 1
  • LLM INTEGRATION
    • LLM Strategy and Implementation
Powered by GitBook
On this page
  • Mapping Urls
  • Extracting data from mapped URLs
  • Enriching data
  • Compare with existing data
  • Merge/Replace

Was this helpful?

  1. Developer Resources

ETL Checklist

This oulines the steps for ETL for sources and data last performed

PreviousETL Strategy for Neo4j Database: Scraping, Transformation, and EnrichmentNextIntroduction

Last updated 4 days ago

Was this helpful?

Source
Map Param
Mapped URLs
Extract Data for providers
Enrich provider info using Web search
Compare with existing data for provider
Merge and Update graph with new information
Last performed

211wny.org

provider

211texas.org (tx211tirn.communityos.org)

resource-public/render/id

hitesite.org

resource

findhelp.org

provider

Mapping Urls

This involves using the map function to get all the urls with the provider info for the source.

The crawler can use a search param to assist in the mapping so the map function can limit its response to only urls that are relevant. Use the param provided in the table to guide in replicating future ETL mapping.

You can find their for the various SDKs or using the online playground to run the mapping.

The code to achieve all the steps are provided in the SSN-ETL-SCRAPING repo which can be found on the Philanthrolab github organization.

Extracting data from mapped URLs

This uses the firecrawl extract function with prompt to get the data from the links mapped in the previous step.

Update the prompt as needed to achieve the required result.

Enriching data

Using the firecrawl search function, the organization name will be used to find other information on the internet. In addition to the extract functionality, this provides information which is compared with the existing data to fill the gaps in the data collected in the previous step.

To ensure we get the most data available, next, we use openAI model to browse the internet and fill the gaps for those fields we lack info on. An AI Agent will be used here to perform the operation.

Compare with existing data

The data gotten is compared with the existing data and next steps suggested, merge or replace this lets us know which data is obsolete and which we can keep.

This is a function of the AI Agent as well, which does the comparison and takes the appropriate action needed.

Merge/Replace

The data is merged or replaced following the directive from the orchestrator in the previous step. For merging, the neo4j merge feature will be used while we override data that is to be replaced.

firecrawl.dev
API docs