user9203730
user9203730

Reputation:

Remove duplicates from table in bigquery

I found duplicates in my table by doing below query.

SELECT name, id, count(1) as count
  FROM [myproject:dev.sample] 
  group by name, id 
  having count(1) > 1

Now i would like to remove these duplicates based on id and name by using DML statement but its showing '0 rows affected' message. Am i missing something?

DELETE FROM PRD.GPBP WHERE
    id not in(select id from [myproject:dev.sample] GROUP BY id) and 
    name not in (select name from [myproject:dev.sample] GROUP BY name) 

Upvotes: 3

Views: 9591

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below query (BigQuery Standard SQL) should be more optimal for de-duping like in your case

#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `myproject.dev.sample` AS t
GROUP BY name, id

If you run it from within UI - you can just set Write Preference to Overwrite Table and you are done

Or if you want you can use DML's INSERT to new table and then copy over original one

Meantime, the easiest way is as below (using DDL)

#standardSQL
CREATE OR REPLACE TABLE `myproject.dev.sample` AS
SELECT * FROM (
  SELECT AS VALUE ANY_VALUE(t)
  FROM `myproject.dev.sample` AS t
  GROUP BY name, id
)

Upvotes: 0

Mayank Porwal
Mayank Porwal

Reputation: 34056

I suggest, you create a new table without the duplicates. Drop your original table and rename the new table to original table.

You can find duplicates like below:

Create table new_table as 
Select name, id, ...... , put our remaining 10 cols here
FROM(
SELECT *, 
ROW_NUMBER() OVER(Partition by name , id Order by id) as rnk
FROM [myproject:dev.sample] 
)a
WHERE rnk = 1;

Then drop the older table and rename new_table with old table name.

Upvotes: 7

Related Questions