Reputation: 1388
What would be the best way to remove duplicates while merging their records into one?
I have a situation where the table keeps track of player names and their records like this:
stats
-------------------------------
nick totalgames wins ...
John 100 40
john 200 97
Whistle 50 47
wHiStLe 75 72
...
I would need to merge the rows where nick is duplicated (when ignoring case) and merge the records into one, like this:
stats
-------------------------------
nick totalgames wins ...
john 300 137
whistle 125 119
...
I'm doing this in Postgres. What would be the best way to do this?
I know that I can get the names where duplicates exist by doing this:
select lower(nick) as nick, totalgames, count(*)
from stats
group by lower(nick), totalgames
having count(*) > 1;
I thought of something like this:
update stats
set totalgames = totalgames + s.totalgames
from (that query up there) s
where lower(nick) = s.nick
Except this doesn't work properly. And I still can't seem to be able to delete the other duplicate rows containing the duplicate names. What can I do? Any suggestions?
Upvotes: 10
Views: 6320
Reputation: 117380
I think easiest way to do it in one query would be using common table expressions:
with cte as (
delete from stats
where lower(nick) in (
select lower(nick) from stats group by lower(nick) having count(*) > 1
)
returning *
)
insert into stats(nick, totalgames, wins)
select lower(nick), sum(totalgames), sum(wins)
from cte
group by lower(nick);
As you see, inside the cte I'm deleting duplicates and returning deleted rows, after that inserting grouped deleted data back into table.
see sql fiddle demo
Upvotes: 4
Reputation: 44250
It can all be done in one statement, using RETURNING
.
-- The data
CREATE TABLE stats
( nick VARCHAR PRIMARY KEY
, totalgames INTEGER NOT NULL DEFAULT 0
, wins INTEGER NOT NULL DEFAULT 0
);
INSERT INTO stats(nick, totalgames,wins) VALUES
( 'John', 100, 40) ,( 'john', 200, 97)
,( 'Whistle', 50, 47) ,( 'wHiStLe', 75, 72)
, ( 'Single', 42, 13 ) -- this person has only one record
;
SELECT * FROM stats;
-- The query:
WITH upd AS (
UPDATE stats dst
SET totalgames = src.totalgames
, wins = src.wins
FROM ( SELECT MIN(nick) AS nick -- pick the "lowest" nick as the canonical nick
, SUM(totalgames) AS totalgames
, SUM(wins) AS wins
FROM stats
GROUP BY lower(nick)
) src
WHERE dst.nick = src.nick
RETURNING dst.nick -- only the records that have been updated
)
-- Delete the records that were NOT updated.
DELETE FROM stats del
WHERE NOT EXISTS (
SELECT * FROM upd
WHERE upd.nick = del.nick
)
;
SELECT * FROM stats;
Output:
INSERT 0 5
nick | totalgames | wins
---------+------------+------
John | 100 | 40
john | 200 | 97
Whistle | 50 | 47
wHiStLe | 75 | 72
Single | 42 | 13
(5 rows)
DELETE 2
nick | totalgames | wins
---------+------------+------
wHiStLe | 125 | 119
john | 300 | 137
Single | 42 | 13
(3 rows)
Upvotes: 3
Reputation: 143
UPDATE stats SET totalgames=s.totalgames, wins=s.wins
FROM (SELECT lower(nick) AS nick,SUM(totalgames) AS totalgames,SUM(wins) AS wins FROM stats
GROUP BY lower(nick))s WHERE lower(nick)=s.nick;
DELETE FROM stats WHERE
lower(nick) IN (SELECT lower(nick) FROM stats GROUP BY lower(nick) HAVING COUNT(*)>1)
AND NOT lower(nick) IN (SELECT first(nick) FROM stats GROUP BY lower(nick)
should work.
Upvotes: 0
Reputation: 2369
Here is your update:
UPDATE stats
SET totalgames = x.games, wins = x.wins
FROM (SELECT LOWER(nick) AS nick, SUM(totalgames) AS games, SUM(wins) AS wins
FROM stats
GROUP BY LOWER(nick) ) AS x
WHERE LOWER(stats.nick) = x.nick;
Here is the delete to blow away the duplicate rows:
DELETE FROM stats USING stats s2
WHERE lower(stats.nick) = lower(s2.nick) AND stats.nick < s2.nick;
(Note that the 'update...from' and 'delete...using' syntax are Postgres-specific, and were stolen shamelessly from this answer and this answer.)
You'll probably also want to run this to downcase all the names:
UPDATE STATS SET nick = lower(nick);
Aaaand throw in a unique index on the lowercase version of 'nick' (or add a constraint to that column to disallow non-lowercase values):
CREATE UNIQUE INDEX ON stats (LOWER(nick));
Upvotes: 10