Reputation: 30809
We are using the following code to write the records to BigQuery:
BigQueryIO.writeTableRows()
.to("table")
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withSchema(schema);
With this code, when we do a backfill, some of the records get sent to this dataflow again, resulting in duplicates in the BigQuery table. Is there any way to configure an upsert
operation based on the field name in the dataflow?
Upvotes: 4
Views: 4475
Reputation: 1864
I had a very similar use case and we decided to solve this issue by creating a view on top of the table which deduplicates the data, and point anything that was accessing the original table to now query from this view instead.
BigQuery deduplication and partitioned table is a very good reference. As is https://wecode.wepay.com/posts/bigquery-wepay
Essentially you need to make sure you have some sort of last_updated column, as well as an id column which uniquely identifies the row. Then you can create a view that gets all the data for the most recent version of each id. There may be some small performance loss because you are querying a view, but we found it to be negligible.
Upvotes: 7
Reputation: 710
Is this still a gap with BigQueryIO writer... seems like a major limitation,most workloads will load more data each night, but on any given night may need to re-run that set. Do not ever want to truncate entire table, but perhaps just truncate that run/set of keys is fine to then append vs an actual update. So ideally there is some TRUNCATE_KEY_SET or normal UPDATE....
Upvotes: 2