Ajay C
Ajay C

Reputation: 66

Can I load historical data in Upsolver/SQLake at a later time after current data is already ingested

Ideally historical data is loaded first and then current data but we have current data already loaded into snowflake from Kafka as Upsert outputs. We have to later ingest historical data as well and that will be loaded from diff source, lets says from S3 dumps. Can we accomplish this?

Upvotes: 0

Views: 59

Answers (1)

Ajay C
Ajay C

Reputation: 66

You definitely need to create new data source for the historical data. This way historical data is ingested into Upsolver.

Design considerations for next step: If the output was append only (just keep inserting), then your Snowflake output could have used both the historical data source and current data source UNION. You can either add multiple data sources while creating output from UI or you can edit the SQL to add UNION for both data sources and both historical and current data will land into target table. This design would handle Upsert use case also if historical data came first and was fully ingested before current data source started receiving data.

However, in this specific ask, since historical data is arriving later, we can't use this approach as historical data that is arriving in future could Upsert and override current latest data.

Solution 1: if you can stop the current data source till the time historical data is completely processed.

  1. Stop the current data source
  2. Create a look up table on the current data source with the record key
  3. Create a historical output which joins the historical data source and the look up to filter only those historical records which do not exist in the lookup and Upsert it into target
  4. Once historical data is processed completely, stop the historical data source, lookup and historical output.
  5. Restart the current data source

Solution 2: If the current data volume is small and can be reprocessed from beginning then

  1. stop the current snowflake output job
  2. Truncate the snowflake target table
  3. Load (Upsert) the historical data to the target
  4. Once historical data is processed completely, stop the historical data source and output.
  5. Replay the current data snowflake output job (stopped in step 1) from the beginning

Solution 3:

  1. Load historical data into separate history table
  2. Use snowflake compute to identify what should get applied to the main table from the historical table and then discard the historical tables. (this would need snowflake joins to identify latest record keys in historical which aren't present in current table and insert them into current table)

Hope this helps.

Upvotes: 0

Related Questions