Reputation:
This is similar to this question, but it seems like some of the answers there aren't quite compatible with MySQL (or I'm not doing it right), and I'm having a heck of a time figuring out the changes I need. Apparently my SQL is rustier than I thought it was. I'm also looking to change a column value rather than delete, but I think at least that part is simple...
I have a table like:
rowid SERIAL fingerprint TEXT duplicate BOOLEAN contents TEXT created_date DATETIME
I want to set duplicate=true for all but the first (by created_date) of each group by fingerprint. It's easy to mark all of the rows with duplicate fingerprints as dupes. The part I'm getting stuck on is keeping the first.
One of the apps that populates the table does bulk loads of data, with multiple workers loading data from different sources, and the workers' data isn't necessarily partitioned by date, so it's a pain to try to mark these all as they come in (the first one inserted isn't necessarily the first one by date). Also, I already have a bunch of data in there I'll need to clean up either way. So I'd rather just have a relatively efficient query I can run after a bulk load to clean up than try to build it into that app.
Thanks!
Upvotes: 2
Views: 1473
Reputation: 753625
Untested...
UPDATE TheAnonymousTable
SET duplicate = TRUE
WHERE rowid NOT IN
(SELECT rowid
FROM (SELECT MIN(created_date) AS created_date, fingerprint
FROM TheAnonymousTable
GROUP BY fingerprint
) AS M,
TheAnonymousTable AS T
WHERE M.created_date = T.created_date
AND M.fingerprint = T.fingerprint
);
The logic is that the innermost query returns the earliest created_date
for each distinct fingerprint as table alias M. The middle query determines the rowid value for each of those rows; it is a nuisance to have to do this (but necessary), and the code assumes that you won't get two records for the same fingerprint and timestamp. This gives you the rowid for the earlist record for each separate fingerprint. Then the outer query (the UPDATE) sets the 'duplicate' flag on all those rows where the rowid is not one of the earliest rows.
Some DBMS may be unhappy about doing (nested) sub-queries on the table being updated.
Upvotes: 0
Reputation: 1232
MySQL needs to be explicitly told if the data you are grouping by is larger than 1024 bytes (see this link for details). So if your data in the fingerprint column is larger than 1024 bytes you should use set the max_sort_length
variable (see this link for details about values allowed, and this link about how to set it) to a larger number so that the group by wont silently use only part of your data for grouping.
Once you're certain that MySQL will group your data properly, the following query will set the duplicate flag so that the first fingerprint record has duplicate set to FALSE/0 and any subsequent fingerprint records have duplicate set to TRUE/1:
UPDATE mytable m1
INNER JOIN (SELECT fingerprint
, MIN(rowid) AS minrow
FROM mytable m2
GROUP BY fingerprint) m3
ON m1.fingerprint = m3.fingerprint
SET m1.duplicate = m3.minrow != m1.rowid;
Please keep in mind that this solution does not take NULLs into account and if it is possible for the fingerprint field to be NULL then you would need additional logic to handle that case.
Upvotes: 2
Reputation: 3413
I don't know the MySQL syntax, but in PLSQL you just do:
UPDATE t1
SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
SELECT TOP 1 rowid FROM MyTable t2
WHERE t2.fingerprint = t1.fingerprint ORDER BY created_date DESC
)
That may have some syntax errors, as I'm just typing off the cuff/not able to test it, but that's the gist of it.
MySQL version (not tested):
UPDATE t1
SET duplicate = 1
FROM MyTable t1
WHERE rowid != (
SELECT rowid FROM MyTable t2
WHERE t2.fingerprint = t1.fingerprint
ORDER BY created_date DESC
LIMIT 1
)
Upvotes: 0
Reputation: 562270
Here's another way to do it, using MySQL's multi-table UPDATE
syntax:
UPDATE mytable m1
JOIN mytable m2 ON (m1.rowid = m2.rowid AND m1.created_date < m2.created_date)
SET m2.duplicate = 1;
Upvotes: 0
Reputation: 562270
Here's a funny way to do it:
SET @rowid := 0;
UPDATE mytable
SET duplicate = (rowid = @rowid),
rowid = (@rowid:=rowid)
ORDER BY rowid, created_date;
UPDATE...ORDER BY
feature to ensure that the rows are updated in order by rowid
, then by created_date
. rowid
is not equal to the user variable @rowid
, set duplicate
to 0 (false). This will be true only on the first row encountered with a given value for rowid
.rowid
to its own value, setting @rowid
to that value as a side effect. UPDATE
the next row, if it's a duplicate of the previous row, rowid
will be equal to the user variable @rowid
, and therefore duplicate
will be set to 1 (true).Edit: Now I have tested this, and I corrected a mistake in the line that sets duplicate
.
Upvotes: 0
Reputation: 48290
How about a two-step approach, assuming you can go offline during a data load:
Not elegant, but gets the job done.
Upvotes: 0