Kalle
Kalle

Reputation: 452

Find duplicate records in large table on multiple columns the right way

I have read many threads on this subject now and tried a few things but it has not worked as I hoped. I need some clarification and apologize if this is considered a duplicate thread.

A client of mine hosts a Postgres database where one table holds a little more then 12 million records. They have tasked me with finding duplicate records, extract them for viewing and if everything looks ok, delete the duplicates.

My main concern has been performance on the server. Running DISTINCT queries on 12 million records must consume a lot of resources?

Since my first task is to extract the records for viewing in, say a CSV, and not simply deleting them my approach in PgAdmin was executing this to a file.

SELECT * 
FROM
    my_table
WHERE
my_table_id NOT IN (

                SELECT DISTINCT 
                    ON (
                        num_1,
                        num_2,
                        num_3,
                        num_4,
                        num_5,
                        my_date
                    )
                    my_table_id
                FROM
                    my_table
);

However this query takes way to long. After 20 minutes of execution time I halted the execution. To make things more complex my client is reluctant to allow me to clone a local copy of the table because of strict security. They prefer it is all done on the live hosting environment.

The table definition is quite simple. It looks like this

CREATE TABLE my_table
(
    my_table_id bigserial NOT NULL,
    num_1 bigserial NOT NULL,
    num_2 bigserial NOT NULL,
    num_3 bigserial NOT NULL,
    num_4 numeric,
    num_5 integer,
    my_date date,
    my_text character varying
)

The primary key "my_table_id" has not been compromised and is always unique. The col "my_text" is not interesting in the query since it will be empty for all duplicates. It is only the numeric fields and the date that needs matching. All columns (except my_table_id and my_text) must match across records to qualify as a duplicate.

What is the best way to solve this? Is there a server-friendly way that won´t eat all resources on the host environment? Please help me understand the best approach!

Thanks you!

Upvotes: 4

Views: 11986

Answers (3)

Rajesh
Rajesh

Reputation: 700

#An easy and simple way to check duplicate records

SELECT COL1,COL2,COL3,COL4, COUNT(*) as DUP_CNT
FROM TABLE_NAME
GROUP BY COL1,COL2,COL3,COL4
HAVING COUNT(*) > 1;

if it returns data you have duplicate records otherwise no

#Drop Constraints

alter table TABLE_NAME drop constraint PK_CONSTRAINTS_NAME;

#Add Constraints

alter table TABLE_NAME add constraint PK_CONSTRAINTS_NAME primary key(COL1,COL2,COL3,COL4);

Upvotes: 0

FuzzyTree
FuzzyTree

Reputation: 32392

Another way using analytic functions

select * from (
    select * , 
    count(*) over (partition by num1,num2,num3,num4,my_date) cnt
    from mytable
) t1 where cnt > 1

Upvotes: 5

radar
radar

Reputation: 13425

Need to use GROUP BY and HAVING to get duplicate records instead of DISTINCT

subquery will find all duplicate records

SELECT * FROM
my_table mt
JOIN
(
     SELECT
            num_1,
            num_2,
            num_3,
            num_4,
            num_5,
            my_date
     FROM
            my_table
     GROUP BY num_1, num_2, num_3, num_4, num_5, my_date
     HAVING COUNT(*) >1
) T 
ON mt.num_1= T.num_1
and mt.num_2= T.num_2
and mt.num_3= T.num_3
and mt.num_4= T.num_4
and mt.num_5= T.num_5
and mt.my_date= T.my_date

Upvotes: 6

Related Questions