Group by custom function in T-SQL

Question

I have a table of people in which there may be duplicates. My goal is to return a list of possible duplicates so that we can combine them into a new person.

I want to group by first_name and last_name, obviously. However, if both person records have a defined birth_date and those dates differ, then I want to exclude the records, since odds are the people are different but happen to have the same name.

The other caveat is that in our system (which I inherited), the birth_date column is NOT NULL, and non-specified birth_dates are set to '1900-01-01'.

Is there a way I can GROUP BY a custom function (or use some other clever logic) that either compares just the birth_date columns checking to see if both are not the default date or they are if the same, or else takes in arguments, like say each person_id and compares the records against each other, returning a BIT to decide whether they should count as the same group?

I'd like to avoid CLR-defined aggregate functions (since I'm inexperienced with it).

So far (without the birth_date comparison) the query I have is:

SELECT * 
FROM core_person P
WHERE last_name + ',' + first_name IN 
    (SELECT last_name + ',' + first_name "name" 
     FROM core_person 
     GROUP BY last_name + ',' + first_name
     HAVING COUNT(*) > 1)
ORDER BY last_name + ',' + first_name

I would like to add something to the GROUP BY clause to compare the birth dates.

Brian Pressler · Accepted Answer

You can use the nullif function to return nulls if the date is equal to 1/1/1900 as nullif(Birthday,'1/1/1900') to your advantage.

This query can get you started to see all the records with their possible matches:

select p1.person_id
from core_person p1
join core_person p2
on p1.person_id <> p2.person_id
    and LEFT(p1.first_name,5) = LEFT(p2.first_name,5)
    and LEFT(p1.last_name,5) = LEFT(p2.last_name,5)
    and isnull(nullif(p1.Birthday,'1/1/1900'), p2.Birthday) = isnull(nullif(p2.Birthday,'1/1/1900'), p1.Birthday)
group by p1.person_id

If either one of the Birthday's are equal to 1/1/1900 it will compare the birthday to itself, otherwise it will only join on equality of the birthday's in both records.

If you don't want to see your matches, you can use a variation of the query above as a sub-query to return only id values that are duplicates:

select core_person
from core_person 
where person_id in
    (
    select p1.person_id
    from core_person p1
    join core_person p2
    on p1.person_id <> p2.person_id
        and LEFT(p1.first_name,5) = LEFT(p2.first_name,5)
        and LEFT(p1.last_name,5) = LEFT(p2.last_name,5)
        and isnull(nullif(p1.Birthday,'1/1/1900'), p2.Birthday) = isnull(nullif(p2.Birthday,'1/1/1900'), p1.Birthday)
    group by p1.person_id
    )

Group by custom function in T-SQL

Answers (2)

Related Questions