Reputation: 235
We are architecting out our data warehousing solutions from Datastore data sources. We would like to load the new inserted/updated/deleted datastore entities within a pre-defined time interval into BigQuery for analytics.
There seems to be several options:
Do backup of whole kind and load data into BigQuery. And Dedup in BigQuery. This is very simple to do conceptually, but loading/process all the data every time seems very inefficient to me.
Publish all the new/updated/deleted entities when the operations are performed and have Dataflow subscribe and TL to BigQuery.
Have a last modified timestamp on the entity, and pull out only those entities that were modified in the specified timeframe. We would like to take this option, but the deleted records seems to be problem, do we have to implement a soft delete?
Any recommendations on the best practice?
Upvotes: 0
Views: 1195
Reputation: 3657
Your #1 is quite wasteful and inefficient. You have to export all the data, not just the changed delta you care about. The backup + load process creates intermediate files in GCS and is somewhat slow, and the loading also comes with limitations.
Option #2 is doable, but it needs more infrastructure. More points of failure.
Option #3 is best I think. Like you already mentioned, a soft delete would help -- you don't need to actually remove the data, just adding an active/inactive
flag or deleted_at
timestamp would do. Also an updated_at
or modified_at
is necessary for you to make the ETL incremental.
Upvotes: 0
Reputation: 1099
There is another option that we have implemented :)
You do BQ streaming insert of all operations (and better to use insert time based partitions) and after this - if needed you can produce consolidated tables (where you have single instance of each records - so you are taking in account update/delete properly) regularly.
What I found interesting that this table with all (just streamed) - non consolidated data - could give few quite interesting insights - like update/delete patterns - which disappear when you consolidate.
Upvotes: 1