Handling duplicates in BigQuery (Nested Table)

Question

I think this is a very simple question but I would like some guidance: I didn't want to have to drop a table to send a new table with the deduplicated records, like using DELETE FROM based on the query below using BigQuery, is it possible? PS: This is a nested table!

SELECT
  *
FROM (
  SELECT
    *,
    ROW_NUMBER()
          OVER (PARTITION BY id, date_register) row_number
  FROM
    dataset.table)
WHERE
  row_number = 1 
 order by id, date_register

Yun Zhang · Accepted Answer

Update: please also check Felipe Hoffa's answer which is simpler, and learn more on this post: BigQuery Deduplication.

You need to exclude row_number from output and overwrite your table using CREATE OR REPLACE TABLE:

CREATE OR REPLACE TABLE your_table AS
PARTITION BY DATE(date_register) 
SELECT
  * EXCEPT(row_number)
FROM (
  SELECT
    *,
    ROW_NUMBER()
          OVER (PARTITION BY id, date_register) row_number
  FROM your_table)
WHERE
  row_number = 1

If you don´t have a partition field defined at the source, I recommend that you create a new table with the partition field to make this query work so that you can automate the process.

Handling duplicates in BigQuery (Nested Table)

Answers (2)

Related Questions