ETL Checklist
This oulines the steps for ETL for sources and data last performed
Last updated
Was this helpful?
This oulines the steps for ETL for sources and data last performed
Last updated
Was this helpful?
211wny.org
provider
211texas.org (tx211tirn.communityos.org)
resource-public/render/id
hitesite.org
resource
findhelp.org
provider
This involves using the map
function to get all the urls with the provider info for the source.
The crawler can use a search param to assist in the mapping so the map function can limit its response to only urls that are relevant. Use the param provided in the table to guide in replicating future ETL mapping.
You can find their for the various SDKs or using the online playground to run the mapping.
This uses the firecrawl extract
function with prompt to get the data from the links mapped in the previous step.
Update the prompt as needed to achieve the required result.
Using the firecrawl search
function, the organization name will be used to find other information on the internet. In addition to the extract functionality, this provides information which is compared with the existing data to fill the gaps in the data collected in the previous step.
To ensure we get the most data available, next, we use openAI model to browse the internet and fill the gaps for those fields we lack info on. An AI Agent will be used here to perform the operation.
The data gotten is compared with the existing data and next steps suggested, merge
or replace
this lets us know which data is obsolete and which we can keep.
This is a function of the AI Agent as well, which does the comparison and takes the appropriate action needed.
The data is merged or replaced following the directive from the orchestrator in the previous step. For merging, the neo4j merge feature will be used while we override data that is to be replaced.