Michael Sheaver
Michael Sheaver

Reputation: 2119

PostgreSQL: Match Common Name Variants (Nicknames)

Scenario

I have a number of enterprise datasets that I must find missing links between, and one of the ways I use for finding potential matches is joining on first and last name. The complication is that we have a significant number of people who use their legal name in one dataset (employee records), but they use either a nickname or (worse yet) their middle name in others (i.e., EAD, training, PIV card, etc.). I am looking for a way to match up these potentially disparate names across the various datasets.

Simplified Example

Here is an overly simplified example of what I am trying to do, but I think it conveys my thought process. I begin with the employee table:

Employees table

employee_id first_name last_name
052451 Robert Armsden
442896 Jacob Craxford
054149 Grant Keeting
025747 Gabrielle Renton
071238 Margaret Seifenmacher

and try to find the matching data from the PIV card dataset:

Cards table

card_id first_name last_name
1008571527 Bobbie Armsden
1009599982 Jake Craxford
1004786477 Gabi Renton
1000628540 Maggy Seifenmacher

Desired Result

After trying to match these datasets on first name and last name, I would like to end up with the following:

Employees_Cards table

emp_employee_id emp_first_name emp_last_name crd_card_id crd_first_name crd_last_name
052451 Robert Armsden 1008571527 Bobbie Armsden
442896 Jacob Craxford 1009599982 Jake Craxford
054149 Grant Keeting NULL NULL NULL
025747 Gabrielle Renton 1004786477 Gabi Renton
071238 Margaret Seifenmacher 1000628540 Maggy Seifenmacher

As you can see, I would like to make the following matches:

Gabrielle -> Gabi
Jacob -> Jacob
Margaret -> Maggy
Robert -> Bobbie

My initial thought was to find a common names dataset along the lines of:

Name_Aliases table

name1 name2 name3 name4
Gabrielle Gabi NULL NULL
Jacob Jake NULL NULL
Margaret Maggy Maggie Meg
Michael Mike Mikey Mick
Robert Bobbie Bob Rob

and use something like this for the JOIN:

CREATE TABLE employee_cards AS
    SELECT 
        employees.employee_id AS emp_employee_ID,
        employees.first_name AS emp_first_name,
        employees.last_name AS emp_last_name,
        cards.card_id AS crd_card_id,
        cards.first_name AS crd_first_name,
        cards.last_name AS crd_last_name
    FROM employees
    LEFT OUTER JOIN name_aliases
    LEFT OUTER JOIN cards
        ON employees.first_name IN (
            nane_aliases.name1,
            nane_aliases.name2,
            nane_aliases.name3,
            nane_aliases.name4
        )
        AND employees.last_name = cards.last_name;

This is where I got stuck, since I could not figure out how to tie the result of the first ON condition to the first names in the cards table.

Some Questions

As I ponder more deeply into this problem, I know that I am not the first person who has encountered this need for matching on common name variants. My initial search pointed me to things like fuzzysearch and soundex, but those are not quite what I need for the present scenario (although they may come in handy down the road). Given this, I have several questions for the community:

Downloadable Common Names Variants Dataset?

Has someone compiled or crowd-sourced a comprehensive names variants dataset along the lines of my name-aliases table above? My search landed me on a couple of sites that appear to have this kind of data, but neither one of them is downloadable for import into my local database.

I did find this SO discussion over a decade old, but it did not appear to be current: Database of common name aliases / nicknames of people

Also, I am unable to pay any money for this, so I hope there might be one hiding on github.

Better Way to Structure the Name_Aliases table?

Since each record in the name_aliases table can have two or more entries, is there a better way to set up the structure for this table, to make it infinitely flexible?

Match Any Column in Name_Aliases Table?

How do I set up the JOIN query to match up employees.first_name with any column in name_alises, and then finally match that with cards.first_name?

Better Solution?

Am I taking the wrong approach to solve this problem? Has someone come up with a more flexible and elegant approach to this using PostgreSQL?

Upvotes: 2

Views: 1113

Answers (3)

M. Kirk Jennings
M. Kirk Jennings

Reputation: 29

Don't use the two-column approach, use a given name and related alias table. This way you can use a soundex on the given name and the exact match on the alias table.

like this:

CREATE TABLE [Names].[Alias](
    [AliasId] [int] NOT NULL,
    [GivenNameId] [int] NOT NULL,
    [AliasName] [nvarchar](50) NOT NULL,
    [CreatedBy] [uniqueidentifier] NOT NULL,
    [CreatedDate] [datetime] NOT NULL
) ON [PRIMARY]
GO
/****** Object:  Table [Names].[GivenName]    Script Date: 2/2/2023 3:56:38 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [Names].[GivenName](
    [GivenNameId] [int] IDENTITY(1,1) NOT NULL,
    [GivenName] [nvarchar](50) NOT NULL,
    [CreatedBy] [uniqueidentifier] NOT NULL,
    [CreatedDate] [datetime] NOT NULL,
 CONSTRAINT [PK_GivenName] PRIMARY KEY CLUSTERED 
(
    [GivenNameId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [Names].[Alias] ADD  CONSTRAINT [DF_Alias_CreatedDate]  DEFAULT (getutcdate()) FOR [CreatedDate]
GO
ALTER TABLE [Names].[GivenName] ADD  CONSTRAINT [DF_GivenName_CreatedDate]  DEFAULT (getutcdate()) FOR [CreatedDate]
GO
ALTER TABLE [Names].[Alias]  WITH CHECK ADD  CONSTRAINT [FK_Alias_GivenName] FOREIGN KEY([GivenNameId])
REFERENCES [Names].[GivenName] ([GivenNameId])
GO
ALTER TABLE [Names].[Alias] CHECK CONSTRAINT [FK_Alias_GivenName]
GO

Upvotes: 0

Michael Sheaver
Michael Sheaver

Reputation: 2119

Solution

It was a bit of a battle, but I was able to get all of my questions answered and I was thrilled to finally get everything working on this project. Details are below.

Nicknames Dataset

I found a couple of promising nicknames datasets on Github, and this one looked like it is the most actively maintained: https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup. I downloaded the names.csv file to my computer and imported it into my database with the following code:

Import Script

DROP TABLE IF EXISTS names_aliases_temp;
CREATE TABLE names_aliases_temp
(
   names_data text
);

COPY names_aliases_temp
FROM '~/Downloads/names.csv';

DROP TABLE IF EXISTS names_aliases;
CREATE TABLE names_aliases
(
   id serial,
   nicknames text[]
);

INSERT INTO names_aliases (nicknames)
SELECT string_to_array(names_data,',') FROM names_aliases_temp;

DROP TABLE IF EXISTS names_aliases_temp;

CREATE INDEX idx_gin_names ON names_aliases USING GIN(nicknames);

Names_Aliases Dataset

Here is a sample of what this looks like after importing into the database:

id nicknames
1 {aaron,erin,ronnie,ron}
2 {abbigail,nabby,abby,gail}
3 {abednego,bedney}
4 {abel,ebbie,ab,abe,eb}
5 {abiel,ab}
6 {abigail,nabby,abby,gail}
7 {abijah,ab,bige}
8 {abner,ab}
9 {abraham,ab,abe}
10 {abram,ab, abe}

Note on Using Arrays

I was thrilled at how easy it was to get into array format, but I was even more thrilled at how I was able to use the array format in the JOIN query! Since the number of entries per row vary so widely, I found that the array data type made it a natural fit for this data, and this also made it trivially easy to match records using the ILIKE ANY()operator.

JOIN Query Script

DROP TABLE IF EXISTS employee_cards;

CREATE TABLE employee_cards AS
WITH joined_data AS (
    SELECT
        employees.employee_id AS emp_id,
        employees.first_name  AS emp_first_name,
        employees.last_name   AS emp_last_name,
        cards.first_name      AS crd_first_name,
        cards.last_name       AS crd_last_name,
        cards.card_id         AS crd_id
    FROM employees
        -- Attempt to match first names with nicknames
        LEFT JOIN names_aliases
            ON employees.first_name ILIKE ANY(names_aliases.nicknames)
        LEFT JOIN cards
            -- First match records where name is the same between `employees` and `cards`
            ON (employees.last_name ILIKE cards.last_name
               AND employees.first_name ILIKE cards.first_name)
            -- Then bring in nicknames where no matches are found
            OR (employees.last_name ILIKE cards.last_name
                AND cards.first_name ILIKE ANY(names_aliases.nicknames))
)
-- Put successful matches at the top for each employee and retain only the first row
SELECT DISTINCT ON (emp_id)
    emp_id,
    emp_first_name,
    emp_last_name,
    crd_first_name,
    crd_last_name,
    crd_id
FROM joined_data
ORDER BY
    emp_id, 
    crd_id NULLS LAST;

Tables Used

For your convenience, the three tables used in this exercise are below.

Employees Table

employee_id first_name last_name
052451 Robert Armsden
022448 Michael Brown
442896 Jacob Craxford
054149 Grant Keeting
025747 Gabrielle Renton
425972 Consorcia Reyas
071238 Margaret Seifenmacher
insert into public.employees (employee_id, first_name, last_name)
values  ('052451', 'Robert', 'Armsden'),
        ('022448', 'Michael', 'Brown'),
        ('442896', 'Jacob', 'Craxford'),
        ('054149', 'Grant', 'Keeting'),
        ('025747', 'Gabrielle', 'Renton'),
        ('425972', 'Consorcia', 'Reyas'),
        ('071238', 'Margaret', 'Seifenmacher');

Cards Table

card_id first_name last_name
1008571527 Bob Armsden
1000594085 Michael Brown
1009599982 Jake Craxford
1004786477 Gabby Renton
1009481574 Consorcia Reyas
1000628540 Maggy Seifenmacher
insert into public.cards (card_id, first_name, last_name)
values  ('1008571527', 'Bob', 'Armsden'),
        ('1000594085', 'Michael', 'Brown'),
        ('1009599982', 'Jake', 'Craxford'),
        ('1004786477', 'Gabby', 'Renton'),
        ('1009481574', 'Consorcia', 'Reyas'),
        ('1000628540', 'Maggy', 'Seifenmacher');

Employee Cards (Joined) Table

emp_id emp_first_name emp_last_name crd_first_name crd_last_name crd_id
052451 Robert Armsden Bob Armsden 1008571527
022448 Michael Brown Michael Brown 1000594085
442896 Jacob Craxford Jake Craxford 1009599982
054149 Grant Keeting NULL NULL NULL
025747 Gabrielle Renton Gabby Renton 1004786477
425972 Consorcia Reyas Consorcia Reyas 1009481574
071238 Margaret Seifenmacher Maggy Seifenmacher 1000628540
insert into public.employee_cards (emp_id, emp_first_name, emp_last_name, crd_first_name, crd_last_name, crd_id)
values  ('052451', 'Robert', 'Armsden', 'Bob', 'Armsden', '1008571527'),
        ('022448', 'Michael', 'Brown', 'Michael', 'Brown', '1000594085'),
        ('442896', 'Jacob', 'Craxford', 'Jake', 'Craxford', '1009599982'),
        ('054149', 'Grant', 'Keeting', null, null, null),
        ('025747', 'Gabrielle', 'Renton', 'Gabby', 'Renton', '1004786477'),
        ('425972', 'Consorcia', 'Reyas', 'Consorcia', 'Reyas', '1009481574'),
        ('071238', 'Margaret', 'Seifenmacher', 'Maggy', 'Seifenmacher', '1000628540');

Upvotes: 2

Don R
Don R

Reputation: 623

How to structure and query and the aliases table is an interesting question. I'd suggest organizing it in pairs rather than wider rows, because you don't know in advance how many variations may eventually be needed in a group of connected names, and a two column structure gives you the ability to add to a given group indefinitely:

name1 name2
Jacob Jake
Margaret Maggy
Margaret Maggie
Margaret Meg
Maggy Maggie
Maggy Meg
Maggie Meg

Then you just check both columns in each JOIN in the query, something like this:

SELECT DISTINCT
    employees.employee_id AS emp_employee_id
    ,employees.first_name AS emp_first_name
    ,employees.last_name  AS emp_last_name
    ,cards.card_id        AS card_id
    ,cards.first_name     AS crd_first_name
    ,cards.last_name      AS crd_last_name
FROM
    employees
    INNER JOIN name_aliases ON
        name_aliases.name1    = employees.first_name
        OR name_aliases.name2 = employee.first_name
    LEFT JOIN cards ON
        (
            cards.first_name    = name_aliases.name1
            OR cards.first_name = name_aliases.name2
        )
        AND cards.last_name = employees.last_name

This assumes that some employees aren't in the cards table, and that no two people have the same-ish name. If those assumptions are correct you'll need the DISTINCT to eliminate duplicates on the partial outer join.

You could further simplify the query logic by pairing the names in a full outer join:

name1 name2
Jacob Jacob
Jacob Jake
Jake Jacob
Jake Jake
Margaret Margaret
Margaret Maggy
Maggy Margaret
Maggy Maggy
Meg Meg
Margaret Meg
Meg Margaret
Maggy Meg
Meg Maggy

and so on, and then you could treat the aliases table almost like a many-to-many join table:

SELECT DISTINCT
    employees.employee_id AS emp_employee_id
    ,employees.first_name AS emp_first_name
    ,employees.last_name  AS emp_last_name
    ,cards.card_id        AS card_id
    ,cards.first_name     AS crd_first_name
    ,cards.last_name      AS crd_last_name
FROM
    employees
    INNER JOIN name_aliases ON name_aliases.name1 = employees.first_name
    LEFT JOIN cards ON
        cards.first_name    = name_aliases.name2
        AND cards.last_name = employees.last_name

Upvotes: 0

Related Questions