dx_over_dt
dx_over_dt

Reputation: 14338

Group by custom function in T-SQL

I have a table of people in which there may be duplicates. My goal is to return a list of possible duplicates so that we can combine them into a new person.

I want to group by first_name and last_name, obviously. However, if both person records have a defined birth_date and those dates differ, then I want to exclude the records, since odds are the people are different but happen to have the same name.

The other caveat is that in our system (which I inherited), the birth_date column is NOT NULL, and non-specified birth_dates are set to '1900-01-01'.

Is there a way I can GROUP BY a custom function (or use some other clever logic) that either compares just the birth_date columns checking to see if both are not the default date or they are if the same, or else takes in arguments, like say each person_id and compares the records against each other, returning a BIT to decide whether they should count as the same group?

I'd like to avoid CLR-defined aggregate functions (since I'm inexperienced with it).

So far (without the birth_date comparison) the query I have is:

SELECT * 
FROM core_person P
WHERE last_name + ',' + first_name IN 
    (SELECT last_name + ',' + first_name "name" 
     FROM core_person 
     GROUP BY last_name + ',' + first_name
     HAVING COUNT(*) > 1)
ORDER BY last_name + ',' + first_name

I would like to add something to the GROUP BY clause to compare the birth dates.

Upvotes: 2

Views: 244

Answers (2)

Brian Pressler
Brian Pressler

Reputation: 6713

You can use the nullif function to return nulls if the date is equal to 1/1/1900 as nullif(Birthday,'1/1/1900') to your advantage.

This query can get you started to see all the records with their possible matches:

select p1.person_id
from core_person p1
join core_person p2
on p1.person_id <> p2.person_id
    and LEFT(p1.first_name,5) = LEFT(p2.first_name,5)
    and LEFT(p1.last_name,5) = LEFT(p2.last_name,5)
    and isnull(nullif(p1.Birthday,'1/1/1900'), p2.Birthday) = isnull(nullif(p2.Birthday,'1/1/1900'), p1.Birthday)
group by p1.person_id

If either one of the Birthday's are equal to 1/1/1900 it will compare the birthday to itself, otherwise it will only join on equality of the birthday's in both records.

If you don't want to see your matches, you can use a variation of the query above as a sub-query to return only id values that are duplicates:

select core_person
from core_person 
where person_id in
    (
    select p1.person_id
    from core_person p1
    join core_person p2
    on p1.person_id <> p2.person_id
        and LEFT(p1.first_name,5) = LEFT(p2.first_name,5)
        and LEFT(p1.last_name,5) = LEFT(p2.last_name,5)
        and isnull(nullif(p1.Birthday,'1/1/1900'), p2.Birthday) = isnull(nullif(p2.Birthday,'1/1/1900'), p1.Birthday)
    group by p1.person_id
    )

Upvotes: 1

APH
APH

Reputation: 4154

Instead of grouping, will something like this work for you?

select * from MyTable a
left join MyTable b
on a.person_id < b.person_id 
    and a.first_name = b.first_name 
    and a.last_name = b.last_name
    and (
        a.birthdate = b.birthdate 
        or a.birthdate = '1900-1-1' 
        or b.birthdate = '1900-1-1'
        )

It matches rows where last name and first name match, and either birthdates match or one birthdate is your placeholder value. The person_ID part of the join gets rid of duplicates (e.g. 1 matches to 2, then another row where 2 matches to 1, or 1 matches to 1).

You may want to broaden the match criteria for the names to look at first few characters or use SOUNDEX, but then your matches would probably require more hand-sorting as a final step.

Edit: to return a list of all records that have a possible duplicate in the table, not associated with their matches, use this instead:

select distinct a.* from MyTable a
inner join MyTable b
on a.person_id <> b.person_id 
    and a.first_name = b.first_name 
    and a.last_name = b.last_name
    and (
        a.birthdate = b.birthdate 
        or a.birthdate = '1900-1-1' 
        or b.birthdate = '1900-1-1'
        )
order by a.first_name, a.last_name, a.birthdate

Upvotes: 0

Related Questions