kloop
kloop

Reputation: 4721

how do I remove duplicates from a database?

I have a table with four fields: ID auto increment, a string, and two integers. I want to do something of the sort:

     select count(*) from table group by string

and then use the result to consolidate all counts which are larger than 1.

That is, take all rows which have count larger than 1, and replace all of these rows in the database (which have the same string) with a single row, ID does not matter, and the two integers are the sum over all rows of all of the rows with count larger than 1.

Is that possible using a few simple queries?

Thanks.

Upvotes: 1

Views: 157

Answers (6)

Larry Lustig
Larry Lustig

Reputation: 50970

You can derive this information in a VIEW:

 CREATE VIEW SummarizedData (StringCol, IntCol1, IntCol2, OriginalRowCount) AS
    SELECT StringCol, SUM(IntCol1), SUM(IntCol2), COUNT(*)
    FROM TableName
    GROUP BY StringCol

This will create a virtual table with the information you want. It will include the rows for which there was only one instance of StringCol values as well — if you really don't want those add the phrase HAVING COUNT(*) > 1 to the end of the query.

With this method you can maintain the original table and just read from the summarized data or you can create an empty table structure with the appropriate columns and INSERT from SummarizedData into your new table to get a "real" table with the data.

Upvotes: 0

Brent Baisley
Brent Baisley

Reputation: 12721

You can do it all in a two queries, no temp tables. But you need to run the DELETE query repeatedly since it will only delete 1 duplicate at a time. So if there are 3 copies of a row, you would need to run it twice. But you can just run it until there are no more results.

Update the duplicate rows you are going to keep to contain the count/sum.

UPDATE tablename JOIN (
   SELECT min(id) id,sum(int1) int1,sum(int2) int2 
   FROM tablename GROUP BY string HAVING c>1
) AS dups ON tablename.id=dups.id
SET tablename.int1=dups.int1, tablename.int2

Then you can use that same SELECT query in a DELETE query, using the multiple-table syntax.

DELETE tablename FROM tablename 
JOIN (SELECT max(id) AS id,count(*) c FROM tablename GROUP BY string HAVING c>1) dups
ON tablename.id=dups.id

Just run that DELETE until there are no rows returned (0 affected rows).

Upvotes: 0

John Pick
John Pick

Reputation: 5650

If you can stop the table from being updated by other users, then it's pretty easy.

-- We're going to add records before deleting old ones, so keep track of which records are old.
DECLARE @OldMaxID INT
SELECT @OldMaxID = MAX(ID) FROM table

-- Combine duplicate records into new records
INSERT table (string, int1, int2)
SELECT string, SUM(int1), SUM(int2)
FROM table
GROUP BY string
HAVING COUNT(*) > 1

-- Delete records that were used to make combined records.
DELETE FROM table
WHERE ID <= @OldMaxID
GROUP BY string
HAVING COUNT(*) > 1

Upvotes: 0

Nikola Markovinović
Nikola Markovinović

Reputation: 19356

I would suggest to insert into temporary table data grouped by string AND accompanied by min(id) where there are duplicates. Then update original table with sums where id = min(id), and delete where strings match but ids don't.

 insert into temp
 select string, min(id) id, sum(int1) int1, sum(int2) int2
   from table
  group by string
 having count(*) > 1

 update table, temp
   set table.int1 = temp.int1,
       table.int2 = temp.int2
 where table.id = temp.id
-- Works because there is only one record given a string in temp
 delete table
  where exists (select null from temp where temp.string = table.string and temp.id <> table.id)

Backup is mandatory :-) and a transaction also.

Upvotes: 1

Matt Ball
Matt Ball

Reputation: 359826

Start by selecting just the ones with count > 0, and selecting the sums that you want:

select * from (
    select count(*), string_col, sum(int_col_1), sum(int_col_2)
    from my_table
    group by string_col
) as foo where count > 1

After that, I would put that data into a temporary table, delete the rows you don't want, and insert the data from the temp table into the original one.

Upvotes: 0

Martin.
Martin.

Reputation: 10529

There's a simple way to do this. Just place something like

id NOT IN (select id from table group by string)

in your where statement, which will select only duplicates

Upvotes: 0

Related Questions