TigSh
TigSh

Reputation: 645

BigQuery - DELETE statement to remove duplicates Group by Id

I am trying to use DELETE to remove duplicate records from my BigQuery table.

I found multiple solutions to the above question but most of them use CREATE, REPLACE or SELECT.

The closest solution using DELETE I found was:

BigQuery - DELETE statement to remove duplicates

BigQuery Standard SQL: Delete Duplicates from Table

I have a follow up questions on the below solution:

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
    SELECT AS STRUCT id, MAX(loadTime) loadTime 
    FROM `yourproject.yourdataset.duplicates` where id= '123'
    GROUP BY id)

This statement deletes all the records from the table which does not satisfy the NOT IN condition. For example, if my table looks like below:

Id      Loadtime
123        5
123        4
456        2
321        1

The query above deletes all records except for the first row. How can I modify the query so that it deletes only the 2nd row i.e. it only deletes group by id?

The final output should be:

 Id      Loadtime
 123        5
 456        2
 321        1

Upvotes: 1

Views: 497

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173210

Below should work as per your expectation

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
SELECT AS STRUCT id, MAX(loadTime) loadTime 
FROM `yourproject.yourdataset.duplicates` 
GROUP BY id)  

so, in your sample - it will delete ONLY second row

Id  Loadtime     
123 4    

Upvotes: 1

Related Questions