Reputation: 579
I'm having hard time removing duplicates from database. It's MariaDB (protocol version: 10, 10.3.34-MariaDB Server). I need to remove rows where three columns are equal. I was trying to use WITH clause but database throws error that it can't recognize 'WITH', so I focused on traditional way.
I need to remove rows where foreignId
, column1
and column2
are equal.
I'm checking if there are duplicates like
SELECT foreignId, column1, column2, COUNT(*)
FROM table1
GROUP BY foreignId, column1, column2
HAVING COUNT(*) > 1
Trying to remove duplicates...
DELETE table1
FROM table1
INNER JOIN (
SELECT
p.id,
p.foreignId,
p.column1,
p.column2,
ROW_NUMBER() OVER (
PARTITION BY
p.column1,
p.column2,
p.foreignId
ORDER BY
p.foreignId,
p.column2,
p.column1
) AS row_number
FROM table1 p
GROUP BY p.foreignId, p.column1, p.column2
) dup
ON table1.column1 = dup.column1
WHERE dup.row_number > 1;
I was modifying this code alot but still can't make it work as intended... What am I doing wrong?
Upvotes: 0
Views: 259
Reputation: 5975
Assuming you have a unique column like id, you can do following:
DELETE FROM table1 WHERE ID NOT IN
(SELECT x.id FROM
(SELECT MAX(id) id, MAX(foreignId) foreignId,
MAX(column1) column1, MAX(column2) column2
FROM table1
WHERE ttimestamp IN (SELECT MAX(ttimestamp) FROM table1
GROUP BY foreignID, column1, column2)
GROUP BY foreignId, column1, column2)x);
Please see the working example here: db<>fiddle
Upvotes: 1
Reputation: 147176
You have a few issues with your query:
GROUP BY
in the subqueryORDER BY
in the OVER
clause to ORDER BY p.ts DESC
(where ts
is the name of your timestamp column)JOIN
on the unique id
column; otherwise you will delete any row which has values which have duplicates anywhere i.e. ON table.id = dup.id
That will give you:
DELETE table1
FROM table1
INNER JOIN (
SELECT
p.id,
ROW_NUMBER() OVER (
PARTITION BY
p.column1,
p.column2,
p.foreignId
ORDER BY
p.ttimestamp DESC
) AS rn
FROM table1 p
) dup
ON table1.id = dup.id
WHERE dup.rn > 1
Note I would not use row_number
as a column alias as it is a reserved word, so I've changed it to rn
above.
Demo (thanks to @JonasMetzler) on dbfiddle
Note that if it's possible for duplicate rows to also have the same timestamp value, this query will delete a random selection of those rows. If you want a deterministic result, change the ORDER BY
clause to
ORDER BY
p.ttimestamp DESC,
p.id DESC
which will keep the row with the highest (or lowest if you remove the DESC
after p.id
) id
value.
Upvotes: 2