Simon Breton
Simon Breton

Reputation: 2876

How can I avoid and/or clean duplicated row in BigQuery?

How should I import data in BigQuery on a daily basis when I have potential duplicated row ?

Here is a bit of context. I'm updating data on a daily basis from a spreadsheet to BigQuery. I'm using Google App Script with a simple WRITE_APPEND method.

Sometimes I'm importing data I've already imported the day before. So I'm wondering how I can avoid this ?

Can I build a sql query in order to clean my table from duplicate row every day ? Or is this possible to detect duplicate even before importing them (with some specific command in my job definition for example...) ?

thanks !

Upvotes: 5

Views: 3430

Answers (3)

Hui Zheng
Hui Zheng

Reputation: 3077

If you have a large-size partitioned table, and only want to remove duplicates in a given range without scanning through (cost-saving) and replacing the whole table.

use the MERGE SQL below:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------

DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");

MERGE INTO `your_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `your_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )

) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
    THEN DELETE

WHEN NOT MATCHED THEN INSERT ROW

credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

Upvotes: 0

enle lin
enle lin

Reputation: 1714

As far as I know, the answer provided by Felipe Hoffa is the most effective way to avoid duplicate rows since Bigquery do not normalize data when loading data. The reason is that Bigquery performs best with denormalized data [1]. To better understand it, I’d recommend you to have a look in this SO thread.

I also would like to suggest using SQL aggregate or analytic function to clean the duplicate rows in a Bigquery table, as Felipe Hoffa's or Jordan Tigani's answer in this SO question.

Upvotes: 3

Felipe Hoffa
Felipe Hoffa

Reputation: 59165

  • Step 1: Have a sheet with data to be imported

enter image description here

  • Step 2: Set up your spreadsheet as a federated data source in BigQuery.

enter image description here

  • Step 3: Use DML to load data into an existing table

(requires #standardSql)

#standardSQL
INSERT INTO `fh-bigquery.tt.test_import_native` (id, data)
SELECT *
FROM `fh-bigquery.tt.test_import_sheet` 
WHERE id NOT IN (
  SELECT id 
  FROM `fh-bigquery.tt.test_import_native`
)

WHERE id NOT IN (...) ensures that only rows with new ids are loaded into the table.

Upvotes: 5

Related Questions