Reputation: 14338
I have a table of people in which there may be duplicates. My goal is to return a list of possible duplicates so that we can combine them into a new person.
I want to group by first_name
and last_name
, obviously. However, if both person records have a defined birth_date
and those dates differ, then I want to exclude the records, since odds are the people are different but happen to have the same name.
The other caveat is that in our system (which I inherited), the birth_date
column is NOT NULL
, and non-specified birth_date
s are set to '1900-01-01'
.
Is there a way I can GROUP BY
a custom function (or use some other clever logic) that either compares just the birth_date
columns checking to see if both are not the default date or they are if the same, or else takes in arguments, like say each person_id
and compares the records against each other, returning a BIT
to decide whether they should count as the same group?
I'd like to avoid CLR-defined aggregate functions (since I'm inexperienced with it).
So far (without the birth_date
comparison) the query I have is:
SELECT *
FROM core_person P
WHERE last_name + ',' + first_name IN
(SELECT last_name + ',' + first_name "name"
FROM core_person
GROUP BY last_name + ',' + first_name
HAVING COUNT(*) > 1)
ORDER BY last_name + ',' + first_name
I would like to add something to the GROUP BY
clause to compare the birth dates.
Upvotes: 2
Views: 244
Reputation: 6713
You can use the nullif function to return nulls if the date is equal to 1/1/1900 as nullif(Birthday,'1/1/1900')
to your advantage.
This query can get you started to see all the records with their possible matches:
select p1.person_id
from core_person p1
join core_person p2
on p1.person_id <> p2.person_id
and LEFT(p1.first_name,5) = LEFT(p2.first_name,5)
and LEFT(p1.last_name,5) = LEFT(p2.last_name,5)
and isnull(nullif(p1.Birthday,'1/1/1900'), p2.Birthday) = isnull(nullif(p2.Birthday,'1/1/1900'), p1.Birthday)
group by p1.person_id
If either one of the Birthday's are equal to 1/1/1900 it will compare the birthday to itself, otherwise it will only join on equality of the birthday's in both records.
If you don't want to see your matches, you can use a variation of the query above as a sub-query to return only id values that are duplicates:
select core_person
from core_person
where person_id in
(
select p1.person_id
from core_person p1
join core_person p2
on p1.person_id <> p2.person_id
and LEFT(p1.first_name,5) = LEFT(p2.first_name,5)
and LEFT(p1.last_name,5) = LEFT(p2.last_name,5)
and isnull(nullif(p1.Birthday,'1/1/1900'), p2.Birthday) = isnull(nullif(p2.Birthday,'1/1/1900'), p1.Birthday)
group by p1.person_id
)
Upvotes: 1
Reputation: 4154
Instead of grouping, will something like this work for you?
select * from MyTable a
left join MyTable b
on a.person_id < b.person_id
and a.first_name = b.first_name
and a.last_name = b.last_name
and (
a.birthdate = b.birthdate
or a.birthdate = '1900-1-1'
or b.birthdate = '1900-1-1'
)
It matches rows where last name and first name match, and either birthdates match or one birthdate is your placeholder value. The person_ID part of the join gets rid of duplicates (e.g. 1 matches to 2, then another row where 2 matches to 1, or 1 matches to 1).
You may want to broaden the match criteria for the names to look at first few characters or use SOUNDEX
, but then your matches would probably require more hand-sorting as a final step.
Edit: to return a list of all records that have a possible duplicate in the table, not associated with their matches, use this instead:
select distinct a.* from MyTable a
inner join MyTable b
on a.person_id <> b.person_id
and a.first_name = b.first_name
and a.last_name = b.last_name
and (
a.birthdate = b.birthdate
or a.birthdate = '1900-1-1'
or b.birthdate = '1900-1-1'
)
order by a.first_name, a.last_name, a.birthdate
Upvotes: 0