How to disallow loading duplicate rows to BigQuery?

Question

I was wondering if there is a way to disallow duplicates from BigQuery?

Based on this article I can deduplicate a whole or a partition of a table.

To deduplicate a whole table:

CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;

To deduplicate a table based on partitions defined in a WHERE clause:

MERGE `transactions.testdata` t
USING (
  SELECT DISTINCT *
  FROM `transactions.testdata`
  WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

If there is no way to disallow duplicates then is this a reasonable approach to deduplicate a table?

rmesteves · Accepted Answer

BigQuery doesn't have a mechanism like constraints that can be found in traditional DBMS. In other words, you can't set a primary key or anything like that because BigQuery is not focused on transactions but in fast analysis and scalability. You should think about it as a Data Lake and not as a database with uniqueness property.

If you have an existing table and need to de-duplicate it, the mentioned approaches will work. If you need your table to have unique rows by default and want to programmatically insert unique rows in your table without resorting to external resources, I can suggest you a workaround:

First insert your data into an temporary table
Then, run a query in your temporary table and save the results into your actual table. This step could be programmatically done in some different ways:
- Using the approach you mentioned as a scheduled query
- Using a bq command such as bq query --use_legacy_sql=false --destination_table= 'select distinct * from ' that will query the distinct values in your temporary table and load the results into the target table pointed in the --destination_table attribute. Its important to mention that this approach will also work for partitioned tables.
Finally, drop the temporary table. Like the previous step, this step could be done either using a scheduled query or bq command.

I hope it helps

How to disallow loading duplicate rows to BigQuery?

Answers (1)

Related Questions