Reputation: 129

What is the faster way to calculate number of duplicate rows present in Redshift Table

There are millions of record in table. And need to calculate number of duplicate rows present in my table in Redshift. I could achieve it by using below query,

select 
    sum(cnt) from (select <primary_key>
    , count(*)-1 as cnt 
from 
    table_name 
group by 
    <primary_key> having count(*)>1

Is there a faster way to achieve the same ?
Is there a way do achieve this in a single query without using subquery ?

Thanks.

Upvotes: 3

Answers (2)

AlexYes

Reputation: 4208

If the criteria of duplication is only repeating primary key then

SELECT count(1)-count(distinct <primary_key>) FROM your_table

would work, except if you have specified your column as primary key in Redshift (it doesn't enforce constraint but if you mark a column as primary key count(distinct <primary_key>) will return the same as count(1) even if there are duplicate values in this column

Upvotes: 2

kazzi

Reputation: 534

You can try the following query:

SELECT Column_name, COUNT(*) Count_Duplicate
FROM Table_name
 GROUP BY Column_name
 HAVING COUNT(*) > 1
 ORDER BY COUNT(*) DESC

Upvotes: 4

What is the faster way to calculate number of duplicate rows present in Redshift Table

Answers (2)

Related Questions