Identify duplicates based on multiple columns and parent row

Question

This is an example of table data that I am working on (the table contained a lot of columns, I am showing here only the relevant ones):

Id	job_number	status	parent_id
1	42FWD-42	0	0
2	42FWD-42	1	1
3	42FWD-42	5	1

Id is auto generated. parent_id links the job using the id.

When a new job is created via the app, a new row is created (with status "0"). The auto-generated Id is then used for subsequent rows of same job, and set as parent id.

Another record with status "1" (which is code for started) is also created just after parent record.

Explanation of the problem: due to a bug in the app, there are duplicate set of rows for the same job.

Example of problem

Id	job_number	status	parent_id
1	42FWD-42	0	0
2	42FWD-42	0	0
3	42FWD-42	1	1
4	42FWD-42	1	2
5	42FWD-42	5	1

As you can see from this example, due to the bug, there are 2 rows with "0" status for the same job, and 2 rows with "1" status.

This creates a lot of problems in operation in app where the job is updated using the job number.

The status number should not repeat for a specific job.

What I want to do is to find all duplicates like those in example. For example, I want a query where I can find all duplicates which have same job number, but different parent_id and NO "5" status.

Example result using the example table above, I need the query to return:

Id	job_number	status	parent_id
2	42FWD-42	0	0
4	42FWD-42	1	2

Explanation of this result:

Row with Id=1 is considered the correct record because it has an associated record with status "5"
Row with Id=2 is considered duplicate and its associated records are also considered duplicate

Another possible case: there are duplicate rows, but none have status=5. These rows can be discarded, ie need not be shown in results.

A brief explanation of how the query works will be appreciated.

EDIT:

I forgo to add an important information: job_number is case sensitive.

ie: 42FWD-42 and 42fwd-42 are different and valid job number. They should not be considered duplicates, and are 2 separate jobs.

The reason for this is the actual job number is not small text as in my example. It is a long string like a guid.

GuidoG · Accepted Answer

First I must mention you should block identical rows by means of a unique constraint. I suggest that once you have eliminated all duplicates you put up a such a constraint to keep this from happening again.

Now for your question, you can do this by grouping on the duplicate columns, and have only those that count more than one. Here is an example

declare @t table (id int, job_number varchar(10), status int, parent_id int)
insert into @t
values (1, '42FWD-42', 0, 0), (2, '42FWD-42', 0, 0), (3, '42FWD-42', 1, 1), (4, '42FWD-42', 1, 2), (5, '42FWD-42', 5, 1)

select max(t.id) as id, t.job_number, t.status 
from   @t t
group by t.job_number, t.status
having count(*) > 1

the result is

id  job_number  status
2   42FWD-42    0
4   42FWD-42    1

and to get also the parent_id you can add a self join

select max(t.id) as id, 
       t.job_number, 
       t.status,
       (select t2.parent_id from @t t2 where t2.id = max(t.id)) as parent_id  
from   @t t
group by t.job_number, t.status
having count(*) > 1

this returns

id  job_number  status parent_id
2   42FWD-42    0      0
4   42FWD-42    1      2

EDIT

To solve the addional problem in the edit of your question, about the case sensitive, you can fix that by using a COLLATE in your field retrieval and your comparision

this should do it

declare @t table (id int, job_number varchar(10), status int, parent_id int)
insert into @t
values (1, '42FWD-42', 0, 0), 
       (2, '42FWD-42', 0, 0), 
       (3, '42FWD-42', 1, 1), 
       (4, '42fwd-42', 1, 2), -- LOWERCASE !!!
       (5, '42FWD-42', 5, 1)

select max(t.id) as id, 
       t.job_number COLLATE Latin1_General_CS_AS, 
       t.status,
       (select t2.parent_id from @t t2 where t2.id = max(t.id)) as parent_id  
from   @t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1

and now the result will be

id  job_number  status parent_id
2   42FWD-42    0      0

Yet another edit

Now, suppose you need to use the result of these duplicate id's in another query, you could do something like this

select t.* 
from  @t t
where t.id in ( select max(t.id) as id 
                from   @t t
                group by t.job_number COLLATE Latin1_General_CS_AS, t.status
                having count(*) > 1
              )

What I am doing here is getting only the duplicate id's in a form that can be used to feed a where clause in another query.
This way you can use the result set in any way you wish.
Also note that for this we don't need the self join to retrieve the parent_id anymore.

One possible use of this could be to delete duplicate rows, you can write

delete from yourtable
where id in ( select max(t.id) as id 
                from   @t t
                group by t.job_number COLLATE Latin1_General_CS_AS, t.status
                having count(*) > 1
            )

Identify duplicates based on multiple columns and parent row

Answers (2)

Related Questions