Reputation: 555
To view records without duplicated ones, I use this SQL
SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 1
What is the best practice to display only duplicated records from a single table?
Upvotes: 11
Views: 35418
Reputation: 172944
Below is for BigQuery Standard SQL
Me personally, I prefer not to rely on ROW_NUMBER() whenever it is possible because with big volume of data it tends to lead to Resource Exceeded error
So, from my experience I would recommend below options:
To view records for those orderid with only one entry:
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY orderid
HAVING COUNT(1) = 1
to view records for those orderid with more than one entry:
#standardSQL
SELECT * EXCEPT(flag) FROM (
SELECT *, COUNT(1) OVER(PARTITION BY orderid) > 1 flag
FROM `project.dataset.table`
)
WHERE flag
note: behind the hood - COUNT(1) OVER() can be calculated using as many workers as available while ROW_NUMBER() OVER() requires all respective data to be moved to one worker (thus Resource related issue)
OR
#standardSQL
SELECT *
FROM `project.dataset.table`
WHERE orderid IN (
SELECT orderid FROM `project.dataset.table`
GROUP BY orderid HAVING COUNT(1) > 1
)
Upvotes: 8
Reputation: 1027
Why not just change the row_number
? You have partitionned by order id
, creating partitions of duplicates, ranked the records and take only the first element to remove the duplicates. But if you take only the row_number = 2
, you'll have only elements from partitions with at least 2 elements, i.e only duplicates.
SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 2
Note :Use row_number = 2
will give you only 1 element of duplicates. If you go with row_number > 1
, the result may contain duplicates again (for example if you had 3 identical elements in the first table).
Upvotes: 6
Reputation: 655
You can display the duplicated row by showing only raw with row_number
greater than 1.
select
* except(row_number)
from (
select
*, row_number() over (partition by ) as row_number
from `TABLE`)
where row_number > 1
Upvotes: 4