Reputation: 5890
I am looking to find some records that are very similar (for all intents and purposes, duplicate records) after adding a new field to a table. Here's some sample data.
+-----+------+-----+------+-------------+----------+
| Id | Task | Sig | Form | Description | Location |
+-----+------+-----+------+-------------+----------+
| 255 | 5000 | 1 | 1 | Record 1 | (null) |
| 256 | 5000 | 1 | 1 | Record 1 | 000 |
| 257 | 5001 | 1 | 1 | Record 2 | 0T3 |
| 258 | 5001 | 1 | 2 | Record 3 | 0T3 |
| 259 | 5002 | 1 | 1 | Record 4 | 001 |
| 260 | 5003 | 1 | 1 | Record 5 | 001 |
+-----+------+-----+------+-------------+----------+
How could I design the query to just find 'duplicate' records whose only difference is the Location field?
If I use a query like this:
SELECT *
FROM MY_SAMPLE_TABLE
WHERE Task IN
(SELECT Task FROM MY_SAMPLE_TABLE
GROUP BY Task, Sig, Form, Description HAVING COUNT(*) > 1);
It returns any records with the same Task, unfortunately. And this is a table with tens of thousands of records.
Upvotes: 0
Views: 55
Reputation: 48357
Presumably each record has a unique id.
Since you should have the id of the new record, just do a join:
SELECT IF(new.id=all.id, "New", id) AS recordnum
, all.task
, all.sig
, all.form
, all.description
, all.location
FROM my_sample_table new
INNER JOIN my_sample_table all
ON new.task=all.task
AND new.sig=all.sig
AND new.form=all.form
AND new.description=all.description
-- AND new.id<>all.id -- optional to exclude the new record from the output
WHERE new.id=$THE_INSERTED_ID
If you don't want to do this on insert, but retrospectively,
SELECT task, sig,form, description, COUNT(*), GROUP_CONCAT(id), GROUP_CONCAT(location)
FROM my_sample_table
GROUP BY task, sig,form, description
HAVING COUNT(*)>1
and just use this as a subquery to get the row level records....
SELECT r.*, all.count_dups, all.ids, all.locns
FROM my_sample_table r
INNER JOIN (
SELECT task, sig,form, description, COUNT(*) as count_ups,
GROUP_CONCAT(id) AS ids, GROUP_CONCAT(location) AS locns
FROM my_sample_table
GROUP BY task, sig,form, description
HAVING COUNT(*)>1
) all
ON r.task=all.task
AND r.sig=all.sig
AND r.description=all.description
Upvotes: 0
Reputation: 1269953
One simple method is to use window functions:
select t.*
from (select t.*, count(*) over (partition by task, sig, form, description) as cnt
from my_sample_table
) t
where cnt > 1;
If you actually want the locations to be different, you can use count(distinct)
:
select t.*
from (select t.*,
count(distinct location) over (partition by task, sig, form, description) as cnt
from my_sample_table
) t
where cnt > 1;
If you want to treat NULL
as a "different" value, then the logic is a little more complex.
Upvotes: 1