Baktaawar
Baktaawar

Reputation: 7500

How to Track Data Difference in Data Pipeline

We have a set of user data that comes in every week. Its in Millions. It consists of bunch of attributes for user (visits, category, product etc) by timestamp.

Now every week, a few 100k records are changed- updated, deleted, added etc. This means same user which had certain record last week, might have got some values changed in some attribute etc. Or new users might have got added completely

however, we still currently run the pipeline that feeds off this data every week for downstream tasks, on the entire 11MM + users in its entirety.

That consumes computation resources and hence cloud cost.

Is there a way/tool that helps in only figuring out which user records got updated from previous run so we can only do downstream processing on those 100k records or so instead of running it for entire dataset every week?

Analogy in code versioning is git diff etc which helps to know incremental patches/changes to a file.

For a data eng ETL process, what could be the right tool/way to do it?

Thanks

Upvotes: 1

Views: 242

Answers (1)

Winson Tanputraman
Winson Tanputraman

Reputation: 3642

A couple of strategies that I know of.

  1. Firstly, last_updated_timestamp. Simply find the records where last_updated_timestamp is greater than your last batch's max(last_updated_timestamp). If you don't have last_updated_timestamp, I suggest start implementing it. This is probably the cheapest method.

  2. Secondly, if there is really still a need to further filter down, you can use a hash of a concatenation of all the fields. Example, MD5(field1 || field 2 || ...). This is only worthwhile if the first method is unreliable, because even if you have the hash, comparing them (essentially joining your existing records with the new batch) is not a cheap operation. Unreliable here means due to external factor, such as user behavior. If it is unreliable because it was not implemented properly, fixing it is still the way to go.

Upvotes: 1

Related Questions