Oksana Ok
Oksana Ok

Reputation: 555

How to display in Big Query ONLY duplicated records?

To view records without duplicated ones, I use this SQL

SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 1

What is the best practice to display only duplicated records from a single table?

Upvotes: 11

Views: 35418

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

Below is for BigQuery Standard SQL

Me personally, I prefer not to rely on ROW_NUMBER() whenever it is possible because with big volume of data it tends to lead to Resource Exceeded error

So, from my experience I would recommend below options:

To view records for those orderid with only one entry:

#standardSQL
SELECT AS VALUE ANY_VALUE(t) 
FROM `project.dataset.table` t
GROUP BY orderid
HAVING COUNT(1) = 1  

to view records for those orderid with more than one entry:

#standardSQL
SELECT * EXCEPT(flag) FROM (
  SELECT *, COUNT(1) OVER(PARTITION BY orderid) > 1 flag
  FROM `project.dataset.table` 
) 
WHERE flag  

note: behind the hood - COUNT(1) OVER() can be calculated using as many workers as available while ROW_NUMBER() OVER() requires all respective data to be moved to one worker (thus Resource related issue)

OR

#standardSQL
SELECT *
FROM `project.dataset.table` 
WHERE orderid IN (
  SELECT orderid FROM `project.dataset.table` 
  GROUP BY orderid HAVING COUNT(1) > 1
  )

Upvotes: 8

Rafaël
Rafaël

Reputation: 1027

Why not just change the row_number ? You have partitionned by order id, creating partitions of duplicates, ranked the records and take only the first element to remove the duplicates. But if you take only the row_number = 2, you'll have only elements from partitions with at least 2 elements, i.e only duplicates.

SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 2

Note :Use row_number = 2 will give you only 1 element of duplicates. If you go with row_number > 1, the result may contain duplicates again (for example if you had 3 identical elements in the first table).

Upvotes: 6

Alessandro
Alessandro

Reputation: 655

You can display the duplicated row by showing only raw with row_number greater than 1.

select
  * except(row_number)
from (
  select
    *, row_number() over (partition by ) as row_number
  from `TABLE`)
where row_number > 1

Upvotes: 4

Related Questions