Reputation: 2876
How should I import data in BigQuery on a daily basis when I have potential duplicated row ?
Here is a bit of context. I'm updating data on a daily basis from a spreadsheet to BigQuery. I'm using Google App Script with a simple WRITE_APPEND method.
Sometimes I'm importing data I've already imported the day before. So I'm wondering how I can avoid this ?
Can I build a sql query in order to clean my table from duplicate row every day ? Or is this possible to detect duplicate even before importing them (with some specific command in my job definition for example...) ?
thanks !
Upvotes: 5
Views: 3430
Reputation: 3077
If you have a large-size partitioned table, and only want to remove duplicates in a given range without scanning through (cost-saving) and replacing the whole table.
use the MERGE SQL below:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `your_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `your_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Upvotes: 0
Reputation: 1714
As far as I know, the answer provided by Felipe Hoffa is the most effective way to avoid duplicate rows since Bigquery do not normalize data when loading data. The reason is that Bigquery performs best with denormalized data [1]. To better understand it, I’d recommend you to have a look in this SO thread.
I also would like to suggest using SQL aggregate or analytic function to clean the duplicate rows in a Bigquery table, as Felipe Hoffa's or Jordan Tigani's answer in this SO question.
Upvotes: 3
Reputation: 59165
(requires #standardSql)
#standardSQL
INSERT INTO `fh-bigquery.tt.test_import_native` (id, data)
SELECT *
FROM `fh-bigquery.tt.test_import_sheet`
WHERE id NOT IN (
SELECT id
FROM `fh-bigquery.tt.test_import_native`
)
WHERE id NOT IN (...)
ensures that only rows with new ids are loaded into the table.
Upvotes: 5