Reputation: 2119

PostgreSQL: Match Common Name Variants (Nicknames)

Scenario

I have a number of enterprise datasets that I must find missing links between, and one of the ways I use for finding potential matches is joining on first and last name. The complication is that we have a significant number of people who use their legal name in one dataset (employee records), but they use either a nickname or (worse yet) their middle name in others (i.e., EAD, training, PIV card, etc.). I am looking for a way to match up these potentially disparate names across the various datasets.

Simplified Example

Here is an overly simplified example of what I am trying to do, but I think it conveys my thought process. I begin with the employee table:

Employees table

employee_id	first_name	last_name
052451	Robert	Armsden
442896	Jacob	Craxford
054149	Grant	Keeting
025747	Gabrielle	Renton
071238	Margaret	Seifenmacher

and try to find the matching data from the PIV card dataset:

Cards table

card_id	first_name	last_name
1008571527	Bobbie	Armsden
1009599982	Jake	Craxford
1004786477	Gabi	Renton
1000628540	Maggy	Seifenmacher

Desired Result

After trying to match these datasets on first name and last name, I would like to end up with the following:

Employees_Cards table

emp_employee_id	emp_first_name	emp_last_name	crd_card_id	crd_first_name	crd_last_name
052451	Robert	Armsden	1008571527	Bobbie	Armsden
442896	Jacob	Craxford	1009599982	Jake	Craxford
054149	Grant	Keeting	NULL	NULL	NULL
025747	Gabrielle	Renton	1004786477	Gabi	Renton
071238	Margaret	Seifenmacher	1000628540	Maggy	Seifenmacher

As you can see, I would like to make the following matches:

Gabrielle -> Gabi
Jacob -> Jacob
Margaret -> Maggy
Robert -> Bobbie

My initial thought was to find a common names dataset along the lines of:

Name_Aliases table

name1	name2	name3	name4
Gabrielle	Gabi	NULL	NULL
Jacob	Jake	NULL	NULL
Margaret	Maggy	Maggie	Meg
Michael	Mike	Mikey	Mick
Robert	Bobbie	Bob	Rob

and use something like this for the JOIN:

CREATE TABLE employee_cards AS
    SELECT 
        employees.employee_id AS emp_employee_ID,
        employees.first_name AS emp_first_name,
        employees.last_name AS emp_last_name,
        cards.card_id AS crd_card_id,
        cards.first_name AS crd_first_name,
        cards.last_name AS crd_last_name
    FROM employees
    LEFT OUTER JOIN name_aliases
    LEFT OUTER JOIN cards
        ON employees.first_name IN (
            nane_aliases.name1,
            nane_aliases.name2,
            nane_aliases.name3,
            nane_aliases.name4
        )
        AND employees.last_name = cards.last_name;

This is where I got stuck, since I could not figure out how to tie the result of the first ON condition to the first names in the cards table.

Some Questions

As I ponder more deeply into this problem, I know that I am not the first person who has encountered this need for matching on common name variants. My initial search pointed me to things like fuzzysearch and soundex, but those are not quite what I need for the present scenario (although they may come in handy down the road). Given this, I have several questions for the community:

Downloadable Common Names Variants Dataset?

Has someone compiled or crowd-sourced a comprehensive names variants dataset along the lines of my name-aliases table above? My search landed me on a couple of sites that appear to have this kind of data, but neither one of them is downloadable for import into my local database.

I did find this SO discussion over a decade old, but it did not appear to be current: Database of common name aliases / nicknames of people

Also, I am unable to pay any money for this, so I hope there might be one hiding on github.

Better Way to Structure the Name_Aliases table?

Since each record in the name_aliases table can have two or more entries, is there a better way to set up the structure for this table, to make it infinitely flexible?

Match Any Column in Name_Aliases Table?

How do I set up the JOIN query to match up employees.first_name with any column in name_alises, and then finally match that with cards.first_name?

Better Solution?

Am I taking the wrong approach to solve this problem? Has someone come up with a more flexible and elegant approach to this using PostgreSQL?

Upvotes: 2

Answers (3)

M. Kirk Jennings

Reputation: 29

Don't use the two-column approach, use a given name and related alias table. This way you can use a soundex on the given name and the exact match on the alias table.

like this:

CREATE TABLE [Names].[Alias](
    [AliasId] [int] NOT NULL,
    [GivenNameId] [int] NOT NULL,
    [AliasName] [nvarchar](50) NOT NULL,
    [CreatedBy] [uniqueidentifier] NOT NULL,
    [CreatedDate] [datetime] NOT NULL
) ON [PRIMARY]
GO
/****** Object:  Table [Names].[GivenName]    Script Date: 2/2/2023 3:56:38 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [Names].[GivenName](
    [GivenNameId] [int] IDENTITY(1,1) NOT NULL,
    [GivenName] [nvarchar](50) NOT NULL,
    [CreatedBy] [uniqueidentifier] NOT NULL,
    [CreatedDate] [datetime] NOT NULL,
 CONSTRAINT [PK_GivenName] PRIMARY KEY CLUSTERED 
(
    [GivenNameId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [Names].[Alias] ADD  CONSTRAINT [DF_Alias_CreatedDate]  DEFAULT (getutcdate()) FOR [CreatedDate]
GO
ALTER TABLE [Names].[GivenName] ADD  CONSTRAINT [DF_GivenName_CreatedDate]  DEFAULT (getutcdate()) FOR [CreatedDate]
GO
ALTER TABLE [Names].[Alias]  WITH CHECK ADD  CONSTRAINT [FK_Alias_GivenName] FOREIGN KEY([GivenNameId])
REFERENCES [Names].[GivenName] ([GivenNameId])
GO
ALTER TABLE [Names].[Alias] CHECK CONSTRAINT [FK_Alias_GivenName]
GO

Upvotes: 0

Michael Sheaver

Reputation: 2119

Solution

It was a bit of a battle, but I was able to get all of my questions answered and I was thrilled to finally get everything working on this project. Details are below.

Nicknames Dataset

I found a couple of promising nicknames datasets on Github, and this one looked like it is the most actively maintained: https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup. I downloaded the names.csv file to my computer and imported it into my database with the following code:

Import Script

DROP TABLE IF EXISTS names_aliases_temp;
CREATE TABLE names_aliases_temp
(
   names_data text
);

COPY names_aliases_temp
FROM '~/Downloads/names.csv';

DROP TABLE IF EXISTS names_aliases;
CREATE TABLE names_aliases
(
   id serial,
   nicknames text[]
);

INSERT INTO names_aliases (nicknames)
SELECT string_to_array(names_data,',') FROM names_aliases_temp;

DROP TABLE IF EXISTS names_aliases_temp;

CREATE INDEX idx_gin_names ON names_aliases USING GIN(nicknames);

Names_Aliases Dataset

Here is a sample of what this looks like after importing into the database:

id	nicknames
1	{aaron,erin,ronnie,ron}
2	{abbigail,nabby,abby,gail}
3	{abednego,bedney}
4	{abel,ebbie,ab,abe,eb}
5	{abiel,ab}
6	{abigail,nabby,abby,gail}
7	{abijah,ab,bige}
8	{abner,ab}
9	{abraham,ab,abe}
10	{abram,ab, abe}

Note on Using Arrays

I was thrilled at how easy it was to get into array format, but I was even more thrilled at how I was able to use the array format in the JOIN query! Since the number of entries per row vary so widely, I found that the array data type made it a natural fit for this data, and this also made it trivially easy to match records using the ILIKE ANY()operator.

JOIN Query Script

DROP TABLE IF EXISTS employee_cards;

CREATE TABLE employee_cards AS
WITH joined_data AS (
    SELECT
        employees.employee_id AS emp_id,
        employees.first_name  AS emp_first_name,
        employees.last_name   AS emp_last_name,
        cards.first_name      AS crd_first_name,
        cards.last_name       AS crd_last_name,
        cards.card_id         AS crd_id
    FROM employees
        -- Attempt to match first names with nicknames
        LEFT JOIN names_aliases
            ON employees.first_name ILIKE ANY(names_aliases.nicknames)
        LEFT JOIN cards
            -- First match records where name is the same between `employees` and `cards`
            ON (employees.last_name ILIKE cards.last_name
               AND employees.first_name ILIKE cards.first_name)
            -- Then bring in nicknames where no matches are found
            OR (employees.last_name ILIKE cards.last_name
                AND cards.first_name ILIKE ANY(names_aliases.nicknames))
)
-- Put successful matches at the top for each employee and retain only the first row
SELECT DISTINCT ON (emp_id)
    emp_id,
    emp_first_name,
    emp_last_name,
    crd_first_name,
    crd_last_name,
    crd_id
FROM joined_data
ORDER BY
    emp_id, 
    crd_id NULLS LAST;

Tables Used

For your convenience, the three tables used in this exercise are below.

Employees Table

employee_id	first_name	last_name
052451	Robert	Armsden
022448	Michael	Brown
442896	Jacob	Craxford
054149	Grant	Keeting
025747	Gabrielle	Renton
425972	Consorcia	Reyas
071238	Margaret	Seifenmacher

insert into public.employees (employee_id, first_name, last_name)
values  ('052451', 'Robert', 'Armsden'),
        ('022448', 'Michael', 'Brown'),
        ('442896', 'Jacob', 'Craxford'),
        ('054149', 'Grant', 'Keeting'),
        ('025747', 'Gabrielle', 'Renton'),
        ('425972', 'Consorcia', 'Reyas'),
        ('071238', 'Margaret', 'Seifenmacher');

Cards Table

card_id	first_name	last_name
1008571527	Bob	Armsden
1000594085	Michael	Brown
1009599982	Jake	Craxford
1004786477	Gabby	Renton
1009481574	Consorcia	Reyas
1000628540	Maggy	Seifenmacher

insert into public.cards (card_id, first_name, last_name)
values  ('1008571527', 'Bob', 'Armsden'),
        ('1000594085', 'Michael', 'Brown'),
        ('1009599982', 'Jake', 'Craxford'),
        ('1004786477', 'Gabby', 'Renton'),
        ('1009481574', 'Consorcia', 'Reyas'),
        ('1000628540', 'Maggy', 'Seifenmacher');

Employee Cards (Joined) Table

emp_id	emp_first_name	emp_last_name	crd_first_name	crd_last_name	crd_id
052451	Robert	Armsden	Bob	Armsden	1008571527
022448	Michael	Brown	Michael	Brown	1000594085
442896	Jacob	Craxford	Jake	Craxford	1009599982
054149	Grant	Keeting	NULL	NULL	NULL
025747	Gabrielle	Renton	Gabby	Renton	1004786477
425972	Consorcia	Reyas	Consorcia	Reyas	1009481574
071238	Margaret	Seifenmacher	Maggy	Seifenmacher	1000628540

insert into public.employee_cards (emp_id, emp_first_name, emp_last_name, crd_first_name, crd_last_name, crd_id)
values  ('052451', 'Robert', 'Armsden', 'Bob', 'Armsden', '1008571527'),
        ('022448', 'Michael', 'Brown', 'Michael', 'Brown', '1000594085'),
        ('442896', 'Jacob', 'Craxford', 'Jake', 'Craxford', '1009599982'),
        ('054149', 'Grant', 'Keeting', null, null, null),
        ('025747', 'Gabrielle', 'Renton', 'Gabby', 'Renton', '1004786477'),
        ('425972', 'Consorcia', 'Reyas', 'Consorcia', 'Reyas', '1009481574'),
        ('071238', 'Margaret', 'Seifenmacher', 'Maggy', 'Seifenmacher', '1000628540');

Upvotes: 2

Don R

Reputation: 623

How to structure and query and the aliases table is an interesting question. I'd suggest organizing it in pairs rather than wider rows, because you don't know in advance how many variations may eventually be needed in a group of connected names, and a two column structure gives you the ability to add to a given group indefinitely:

name1	name2
Jacob	Jake
Margaret	Maggy
Margaret	Maggie
Margaret	Meg
Maggy	Maggie
Maggy	Meg
Maggie	Meg

Then you just check both columns in each JOIN in the query, something like this:

SELECT DISTINCT
    employees.employee_id AS emp_employee_id
    ,employees.first_name AS emp_first_name
    ,employees.last_name  AS emp_last_name
    ,cards.card_id        AS card_id
    ,cards.first_name     AS crd_first_name
    ,cards.last_name      AS crd_last_name
FROM
    employees
    INNER JOIN name_aliases ON
        name_aliases.name1    = employees.first_name
        OR name_aliases.name2 = employee.first_name
    LEFT JOIN cards ON
        (
            cards.first_name    = name_aliases.name1
            OR cards.first_name = name_aliases.name2
        )
        AND cards.last_name = employees.last_name

This assumes that some employees aren't in the cards table, and that no two people have the same-ish name. If those assumptions are correct you'll need the DISTINCT to eliminate duplicates on the partial outer join.

You could further simplify the query logic by pairing the names in a full outer join:

name1	name2
Jacob	Jacob
Jacob	Jake
Jake	Jacob
Jake	Jake
Margaret	Margaret
Margaret	Maggy
Maggy	Margaret
Maggy	Maggy
Meg	Meg
Margaret	Meg
Meg	Margaret
Maggy	Meg
Meg	Maggy

and so on, and then you could treat the aliases table almost like a many-to-many join table:

SELECT DISTINCT
    employees.employee_id AS emp_employee_id
    ,employees.first_name AS emp_first_name
    ,employees.last_name  AS emp_last_name
    ,cards.card_id        AS card_id
    ,cards.first_name     AS crd_first_name
    ,cards.last_name      AS crd_last_name
FROM
    employees
    INNER JOIN name_aliases ON name_aliases.name1 = employees.first_name
    LEFT JOIN cards ON
        cards.first_name    = name_aliases.name2
        AND cards.last_name = employees.last_name

Upvotes: 0

PostgreSQL: Match Common Name Variants (Nicknames)

Scenario

Simplified Example

Employees table

Cards table

Desired Result

Employees_Cards table

Name_Aliases table

Some Questions

Downloadable Common Names Variants Dataset?

Better Way to Structure the Name_Aliases table?

Match Any Column in Name_Aliases Table?

Better Solution?

Answers (3)

Solution

Nicknames Dataset

Import Script

Names_Aliases Dataset

Note on Using Arrays

JOIN Query Script

Tables Used

Employees Table

Cards Table

Employee Cards (Joined) Table

Related Questions