Reputation: 321
I have a GCP DataFlow pipeline configured with a select SQL query that selects specific rows from a Postgres table and then inserts these rows automatically into the BigQuery dataset. This pipeline is configured to run daily at 12am UTC.
When the pipeline initiates a job, it runs successfully and copies the desired rows. However, when the next job runs, it copies the same set of rows again into the BigQuery table, hence resulting in data duplication.
I wanted to know if there is a way to truncate the BigQuery dataset table before the pipeline runs. It seems like a common problem so looking if there's an easy solution without going into a custom DataFlow template.
Upvotes: 3
Views: 905
Reputation: 1428
BigQueryIO has an option called WriteDisposition, where you can use WRITE_TRUNCATE.
From the link above, WRITE_TRUNCATE means:
Specifies that write should replace a table.
The replacement may occur in multiple steps - for instance by first removing the existing table, then creating a replacement, then filling it in. This is not an atomic operation, and external programs may see the table in any of these intermediate steps.
If your use case can not afford the table being unavailable during the operation, a common pattern is moving the data to a secondary / staging table, and then using atomic operations on BigQuery to replace the original table (e.g., using CREATE OR REPLACE TABLE
).
Upvotes: 3