BigQuery - DELETE statement to remove duplicates Group by Id

Question

I am trying to use DELETE to remove duplicate records from my BigQuery table.

I found multiple solutions to the above question but most of them use CREATE, REPLACE or SELECT.

The closest solution using DELETE I found was:

BigQuery - DELETE statement to remove duplicates

BigQuery Standard SQL: Delete Duplicates from Table

I have a follow up questions on the below solution:

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
    SELECT AS STRUCT id, MAX(loadTime) loadTime 
    FROM `yourproject.yourdataset.duplicates` where id= '123'
    GROUP BY id)

This statement deletes all the records from the table which does not satisfy the NOT IN condition. For example, if my table looks like below:

Id      Loadtime
123        5
123        4
456        2
321        1

The query above deletes all records except for the first row. How can I modify the query so that it deletes only the 2nd row i.e. it only deletes group by id?

The final output should be:

 Id      Loadtime
 123        5
 456        2
 321        1

Mikhail Berlyant · Accepted Answer

Below should work as per your expectation

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
SELECT AS STRUCT id, MAX(loadTime) loadTime 
FROM `yourproject.yourdataset.duplicates` 
GROUP BY id)

so, in your sample - it will delete ONLY second row

Id  Loadtime     
123 4

BigQuery - DELETE statement to remove duplicates Group by Id

Answers (1)

Related Questions