Reputation: 720
I was wondering if there is a way to disallow duplicates from BigQuery?
Based on this article I can deduplicate a whole or a partition of a table.
To deduplicate a whole table:
CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;
To deduplicate a table based on partitions defined in a WHERE
clause:
MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
If there is no way to disallow duplicates then is this a reasonable approach to deduplicate a table?
Upvotes: 1
Views: 1481
Reputation: 4085
BigQuery
doesn't have a mechanism like constraints that can be found in traditional DBMS. In other words, you can't set a primary key or anything like that because BigQuery
is not focused on transactions but in fast analysis and scalability. You should think about it as a Data Lake and not as a database with uniqueness property.
If you have an existing table and need to de-duplicate it, the mentioned approaches will work. If you need your table to have unique rows by default and want to programmatically insert unique rows in your table without resorting to external resources, I can suggest you a workaround:
Then, run a query in your temporary table and save the results into your actual table. This step could be programmatically done in some different ways:
bq query --use_legacy_sql=false --destination_table=<dataset.actual_table> 'select distinct * from <dataset.temporary_table>'
that will query the distinct values in your temporary table and load the results into the target table pointed in the --destination_table
attribute. Its important to mention that this approach will also work for partitioned tables.Finally, drop the temporary table. Like the previous step, this step could be done either using a scheduled query or bq command.
I hope it helps
Upvotes: 3