Killnine
Killnine

Reputation: 5890

Select similar records in SQL

I am looking to find some records that are very similar (for all intents and purposes, duplicate records) after adding a new field to a table. Here's some sample data.

+-----+------+-----+------+-------------+----------+
| Id  | Task | Sig | Form | Description | Location |
+-----+------+-----+------+-------------+----------+
| 255 | 5000 |   1 |    1 | Record 1    | (null)   |
| 256 | 5000 |   1 |    1 | Record 1    | 000      |
| 257 | 5001 |   1 |    1 | Record 2    | 0T3      |
| 258 | 5001 |   1 |    2 | Record 3    | 0T3      |
| 259 | 5002 |   1 |    1 | Record 4    | 001      |
| 260 | 5003 |   1 |    1 | Record 5    | 001      |
+-----+------+-----+------+-------------+----------+

How could I design the query to just find 'duplicate' records whose only difference is the Location field?

If I use a query like this:

SELECT *
FROM MY_SAMPLE_TABLE
WHERE Task IN
  (SELECT Task FROM MY_SAMPLE_TABLE
  GROUP BY Task, Sig, Form, Description HAVING COUNT(*) > 1);

It returns any records with the same Task, unfortunately. And this is a table with tens of thousands of records.

Upvotes: 0

Views: 55

Answers (2)

symcbean
symcbean

Reputation: 48357

Presumably each record has a unique id.

Since you should have the id of the new record, just do a join:

SELECT IF(new.id=all.id, "New", id) AS recordnum
, all.task
, all.sig
, all.form
, all.description
, all.location
FROM my_sample_table new
INNER JOIN my_sample_table all
ON new.task=all.task
AND new.sig=all.sig
AND new.form=all.form
AND new.description=all.description
-- AND new.id<>all.id -- optional to exclude the new record from the output
WHERE new.id=$THE_INSERTED_ID

If you don't want to do this on insert, but retrospectively,

SELECT task, sig,form, description, COUNT(*), GROUP_CONCAT(id), GROUP_CONCAT(location)
FROM my_sample_table
GROUP BY task, sig,form, description
HAVING COUNT(*)>1

and just use this as a subquery to get the row level records....

SELECT r.*, all.count_dups, all.ids, all.locns
FROM my_sample_table r
INNER JOIN (
   SELECT task, sig,form, description, COUNT(*) as count_ups,
      GROUP_CONCAT(id) AS ids, GROUP_CONCAT(location) AS locns
   FROM my_sample_table
   GROUP BY task, sig,form, description
   HAVING COUNT(*)>1
) all
ON r.task=all.task
AND r.sig=all.sig
AND r.description=all.description

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269953

One simple method is to use window functions:

select t.*
from (select t.*, count(*) over (partition by task, sig, form, description) as cnt
      from my_sample_table
     ) t
where cnt > 1;

If you actually want the locations to be different, you can use count(distinct):

select t.*
from (select t.*,
             count(distinct location) over (partition by task, sig, form, description) as cnt
      from my_sample_table
     ) t
where cnt > 1;

If you want to treat NULL as a "different" value, then the logic is a little more complex.

Upvotes: 1

Related Questions