Reputation: 2119
I have a number of enterprise datasets that I must find missing links between, and one of the ways I use for finding potential matches is joining on first and last name. The complication is that we have a significant number of people who use their legal name in one dataset (employee records), but they use either a nickname or (worse yet) their middle name in others (i.e., EAD, training, PIV card, etc.). I am looking for a way to match up these potentially disparate names across the various datasets.
Here is an overly simplified example of what I am trying to do, but I think it conveys my thought process. I begin with the employee table:
employee_id | first_name | last_name |
---|---|---|
052451 | Robert | Armsden |
442896 | Jacob | Craxford |
054149 | Grant | Keeting |
025747 | Gabrielle | Renton |
071238 | Margaret | Seifenmacher |
and try to find the matching data from the PIV card dataset:
card_id | first_name | last_name |
---|---|---|
1008571527 | Bobbie | Armsden |
1009599982 | Jake | Craxford |
1004786477 | Gabi | Renton |
1000628540 | Maggy | Seifenmacher |
After trying to match these datasets on first name and last name, I would like to end up with the following:
emp_employee_id | emp_first_name | emp_last_name | crd_card_id | crd_first_name | crd_last_name |
---|---|---|---|---|---|
052451 | Robert | Armsden | 1008571527 | Bobbie | Armsden |
442896 | Jacob | Craxford | 1009599982 | Jake | Craxford |
054149 | Grant | Keeting | NULL | NULL | NULL |
025747 | Gabrielle | Renton | 1004786477 | Gabi | Renton |
071238 | Margaret | Seifenmacher | 1000628540 | Maggy | Seifenmacher |
As you can see, I would like to make the following matches:
Gabrielle -> Gabi
Jacob -> Jacob
Margaret -> Maggy
Robert -> Bobbie
My initial thought was to find a common names dataset along the lines of:
name1 | name2 | name3 | name4 |
---|---|---|---|
Gabrielle | Gabi | NULL | NULL |
Jacob | Jake | NULL | NULL |
Margaret | Maggy | Maggie | Meg |
Michael | Mike | Mikey | Mick |
Robert | Bobbie | Bob | Rob |
and use something like this for the JOIN:
CREATE TABLE employee_cards AS
SELECT
employees.employee_id AS emp_employee_ID,
employees.first_name AS emp_first_name,
employees.last_name AS emp_last_name,
cards.card_id AS crd_card_id,
cards.first_name AS crd_first_name,
cards.last_name AS crd_last_name
FROM employees
LEFT OUTER JOIN name_aliases
LEFT OUTER JOIN cards
ON employees.first_name IN (
nane_aliases.name1,
nane_aliases.name2,
nane_aliases.name3,
nane_aliases.name4
)
AND employees.last_name = cards.last_name;
This is where I got stuck, since I could not figure out how to tie the result of the first ON condition to the first names in the cards table.
As I ponder more deeply into this problem, I know that I am not the first person who has encountered this need for matching on common name variants. My initial search pointed me to things like fuzzysearch
and soundex
, but those are not quite what I need for the present scenario (although they may come in handy down the road). Given this, I have several questions for the community:
Has someone compiled or crowd-sourced a comprehensive names variants dataset along the lines of my name-aliases
table above? My search landed me on a couple of sites that appear to have this kind of data, but neither one of them is downloadable for import into my local database.
I did find this SO discussion over a decade old, but it did not appear to be current: Database of common name aliases / nicknames of people
Also, I am unable to pay any money for this, so I hope there might be one hiding on github.
Since each record in the name_aliases table can have two or more entries, is there a better way to set up the structure for this table, to make it infinitely flexible?
How do I set up the JOIN query to match up employees.first_name with any column in name_alises, and then finally match that with cards.first_name?
Am I taking the wrong approach to solve this problem? Has someone come up with a more flexible and elegant approach to this using PostgreSQL?
Upvotes: 2
Views: 1113
Reputation: 29
Don't use the two-column approach, use a given name and related alias table. This way you can use a soundex on the given name and the exact match on the alias table.
like this:
CREATE TABLE [Names].[Alias](
[AliasId] [int] NOT NULL,
[GivenNameId] [int] NOT NULL,
[AliasName] [nvarchar](50) NOT NULL,
[CreatedBy] [uniqueidentifier] NOT NULL,
[CreatedDate] [datetime] NOT NULL
) ON [PRIMARY]
GO
/****** Object: Table [Names].[GivenName] Script Date: 2/2/2023 3:56:38 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [Names].[GivenName](
[GivenNameId] [int] IDENTITY(1,1) NOT NULL,
[GivenName] [nvarchar](50) NOT NULL,
[CreatedBy] [uniqueidentifier] NOT NULL,
[CreatedDate] [datetime] NOT NULL,
CONSTRAINT [PK_GivenName] PRIMARY KEY CLUSTERED
(
[GivenNameId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [Names].[Alias] ADD CONSTRAINT [DF_Alias_CreatedDate] DEFAULT (getutcdate()) FOR [CreatedDate]
GO
ALTER TABLE [Names].[GivenName] ADD CONSTRAINT [DF_GivenName_CreatedDate] DEFAULT (getutcdate()) FOR [CreatedDate]
GO
ALTER TABLE [Names].[Alias] WITH CHECK ADD CONSTRAINT [FK_Alias_GivenName] FOREIGN KEY([GivenNameId])
REFERENCES [Names].[GivenName] ([GivenNameId])
GO
ALTER TABLE [Names].[Alias] CHECK CONSTRAINT [FK_Alias_GivenName]
GO
Upvotes: 0
Reputation: 2119
It was a bit of a battle, but I was able to get all of my questions answered and I was thrilled to finally get everything working on this project. Details are below.
I found a couple of promising nicknames datasets on Github, and this one looked like it is the most actively maintained: https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup. I downloaded the names.csv file to my computer and imported it into my database with the following code:
DROP TABLE IF EXISTS names_aliases_temp;
CREATE TABLE names_aliases_temp
(
names_data text
);
COPY names_aliases_temp
FROM '~/Downloads/names.csv';
DROP TABLE IF EXISTS names_aliases;
CREATE TABLE names_aliases
(
id serial,
nicknames text[]
);
INSERT INTO names_aliases (nicknames)
SELECT string_to_array(names_data,',') FROM names_aliases_temp;
DROP TABLE IF EXISTS names_aliases_temp;
CREATE INDEX idx_gin_names ON names_aliases USING GIN(nicknames);
Here is a sample of what this looks like after importing into the database:
id | nicknames |
---|---|
1 | {aaron,erin,ronnie,ron} |
2 | {abbigail,nabby,abby,gail} |
3 | {abednego,bedney} |
4 | {abel,ebbie,ab,abe,eb} |
5 | {abiel,ab} |
6 | {abigail,nabby,abby,gail} |
7 | {abijah,ab,bige} |
8 | {abner,ab} |
9 | {abraham,ab,abe} |
10 | {abram,ab, abe} |
I was thrilled at how easy it was to get into array format, but I was even more thrilled at how I was able to use the array format in the JOIN query! Since the number of entries per row vary so widely, I found that the array
data type made it a natural fit for this data, and this also made it trivially easy to match records using the ILIKE ANY()
operator.
DROP TABLE IF EXISTS employee_cards;
CREATE TABLE employee_cards AS
WITH joined_data AS (
SELECT
employees.employee_id AS emp_id,
employees.first_name AS emp_first_name,
employees.last_name AS emp_last_name,
cards.first_name AS crd_first_name,
cards.last_name AS crd_last_name,
cards.card_id AS crd_id
FROM employees
-- Attempt to match first names with nicknames
LEFT JOIN names_aliases
ON employees.first_name ILIKE ANY(names_aliases.nicknames)
LEFT JOIN cards
-- First match records where name is the same between `employees` and `cards`
ON (employees.last_name ILIKE cards.last_name
AND employees.first_name ILIKE cards.first_name)
-- Then bring in nicknames where no matches are found
OR (employees.last_name ILIKE cards.last_name
AND cards.first_name ILIKE ANY(names_aliases.nicknames))
)
-- Put successful matches at the top for each employee and retain only the first row
SELECT DISTINCT ON (emp_id)
emp_id,
emp_first_name,
emp_last_name,
crd_first_name,
crd_last_name,
crd_id
FROM joined_data
ORDER BY
emp_id,
crd_id NULLS LAST;
For your convenience, the three tables used in this exercise are below.
employee_id | first_name | last_name |
---|---|---|
052451 | Robert | Armsden |
022448 | Michael | Brown |
442896 | Jacob | Craxford |
054149 | Grant | Keeting |
025747 | Gabrielle | Renton |
425972 | Consorcia | Reyas |
071238 | Margaret | Seifenmacher |
insert into public.employees (employee_id, first_name, last_name)
values ('052451', 'Robert', 'Armsden'),
('022448', 'Michael', 'Brown'),
('442896', 'Jacob', 'Craxford'),
('054149', 'Grant', 'Keeting'),
('025747', 'Gabrielle', 'Renton'),
('425972', 'Consorcia', 'Reyas'),
('071238', 'Margaret', 'Seifenmacher');
card_id | first_name | last_name |
---|---|---|
1008571527 | Bob | Armsden |
1000594085 | Michael | Brown |
1009599982 | Jake | Craxford |
1004786477 | Gabby | Renton |
1009481574 | Consorcia | Reyas |
1000628540 | Maggy | Seifenmacher |
insert into public.cards (card_id, first_name, last_name)
values ('1008571527', 'Bob', 'Armsden'),
('1000594085', 'Michael', 'Brown'),
('1009599982', 'Jake', 'Craxford'),
('1004786477', 'Gabby', 'Renton'),
('1009481574', 'Consorcia', 'Reyas'),
('1000628540', 'Maggy', 'Seifenmacher');
emp_id | emp_first_name | emp_last_name | crd_first_name | crd_last_name | crd_id |
---|---|---|---|---|---|
052451 | Robert | Armsden | Bob | Armsden | 1008571527 |
022448 | Michael | Brown | Michael | Brown | 1000594085 |
442896 | Jacob | Craxford | Jake | Craxford | 1009599982 |
054149 | Grant | Keeting | NULL | NULL | NULL |
025747 | Gabrielle | Renton | Gabby | Renton | 1004786477 |
425972 | Consorcia | Reyas | Consorcia | Reyas | 1009481574 |
071238 | Margaret | Seifenmacher | Maggy | Seifenmacher | 1000628540 |
insert into public.employee_cards (emp_id, emp_first_name, emp_last_name, crd_first_name, crd_last_name, crd_id)
values ('052451', 'Robert', 'Armsden', 'Bob', 'Armsden', '1008571527'),
('022448', 'Michael', 'Brown', 'Michael', 'Brown', '1000594085'),
('442896', 'Jacob', 'Craxford', 'Jake', 'Craxford', '1009599982'),
('054149', 'Grant', 'Keeting', null, null, null),
('025747', 'Gabrielle', 'Renton', 'Gabby', 'Renton', '1004786477'),
('425972', 'Consorcia', 'Reyas', 'Consorcia', 'Reyas', '1009481574'),
('071238', 'Margaret', 'Seifenmacher', 'Maggy', 'Seifenmacher', '1000628540');
Upvotes: 2
Reputation: 623
How to structure and query and the aliases table is an interesting question. I'd suggest organizing it in pairs rather than wider rows, because you don't know in advance how many variations may eventually be needed in a group of connected names, and a two column structure gives you the ability to add to a given group indefinitely:
name1 | name2 |
---|---|
Jacob | Jake |
Margaret | Maggy |
Margaret | Maggie |
Margaret | Meg |
Maggy | Maggie |
Maggy | Meg |
Maggie | Meg |
Then you just check both columns in each JOIN in the query, something like this:
SELECT DISTINCT
employees.employee_id AS emp_employee_id
,employees.first_name AS emp_first_name
,employees.last_name AS emp_last_name
,cards.card_id AS card_id
,cards.first_name AS crd_first_name
,cards.last_name AS crd_last_name
FROM
employees
INNER JOIN name_aliases ON
name_aliases.name1 = employees.first_name
OR name_aliases.name2 = employee.first_name
LEFT JOIN cards ON
(
cards.first_name = name_aliases.name1
OR cards.first_name = name_aliases.name2
)
AND cards.last_name = employees.last_name
This assumes that some employees aren't in the cards table, and that no two people have the same-ish name. If those assumptions are correct you'll need the DISTINCT to eliminate duplicates on the partial outer join.
You could further simplify the query logic by pairing the names in a full outer join:
name1 | name2 |
---|---|
Jacob | Jacob |
Jacob | Jake |
Jake | Jacob |
Jake | Jake |
Margaret | Margaret |
Margaret | Maggy |
Maggy | Margaret |
Maggy | Maggy |
Meg | Meg |
Margaret | Meg |
Meg | Margaret |
Maggy | Meg |
Meg | Maggy |
and so on, and then you could treat the aliases table almost like a many-to-many join table:
SELECT DISTINCT
employees.employee_id AS emp_employee_id
,employees.first_name AS emp_first_name
,employees.last_name AS emp_last_name
,cards.card_id AS card_id
,cards.first_name AS crd_first_name
,cards.last_name AS crd_last_name
FROM
employees
INNER JOIN name_aliases ON name_aliases.name1 = employees.first_name
LEFT JOIN cards ON
cards.first_name = name_aliases.name2
AND cards.last_name = employees.last_name
Upvotes: 0