Felipe FB
Felipe FB

Reputation: 1342

Handling duplicates in BigQuery (Nested Table)

I think this is a very simple question but I would like some guidance: I didn't want to have to drop a table to send a new table with the deduplicated records, like using DELETE FROM based on the query below using BigQuery, is it possible? PS: This is a nested table!

SELECT
  *
FROM (
  SELECT
    *,
    ROW_NUMBER()
          OVER (PARTITION BY id, date_register) row_number
  FROM
    dataset.table)
WHERE
  row_number = 1 
 order by id, date_register

Upvotes: 0

Views: 659

Answers (2)

Yun Zhang
Yun Zhang

Reputation: 5503

Update: please also check Felipe Hoffa's answer which is simpler, and learn more on this post: BigQuery Deduplication.


You need to exclude row_number from output and overwrite your table using CREATE OR REPLACE TABLE:

CREATE OR REPLACE TABLE your_table AS
PARTITION BY DATE(date_register) 
SELECT
  * EXCEPT(row_number)
FROM (
  SELECT
    *,
    ROW_NUMBER()
          OVER (PARTITION BY id, date_register) row_number
  FROM your_table)
WHERE
  row_number = 1 

If you don´t have a partition field defined at the source, I recommend that you create a new table with the partition field to make this query work so that you can automate the process.

Upvotes: 1

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

To de-duplicate in place, without re-creating the table - use MERGE:

MERGE `temp.many_random` t
USING (
  SELECT DISTINCT *
  FROM `temp.many_random`
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

It's simpler than the current accepted answer, as it won't ask you to match the current partitioning or clustering - it will just respect it.

Upvotes: 2

Related Questions