superblowncolon
superblowncolon

Reputation: 185

Logic and query for getting first name from full name

I have a large database with a full name field. The full name can be in any format and can also include title. For example, all of the following are possible:

John Smith
Smith, John
Mr. John Smith
Dr. John Smith
Mrs. Jane Smith
Ms. Jane Smith
Jane Smith, Esq.
Jane Smith, MD

I want to preserve the full name field, but also add a predicted first name field from a separate table (that contains name, gender).

I think the proper logic for this is to match the first name values + a space to the full name table via LIKE. The space is so that "David Johnson" doesn't match to "John."

I think the way to accomplish this is an update statement with a subquery in it. Here's what I have so far:

UPDATE "employees" 
    SET "employees".FirstName = (SELECT  firstname 
                                 FROM genders 
                                 WHERE fullname LIKE '%"employees".FirstName %')

Upvotes: 0

Views: 329

Answers (1)

Zack
Zack

Reputation: 2341

What you really want to do is use Postgres's full text search capabilities. You can create a stopwords list containing titles to exclude (Mr, Ms, etc.). Then, set up a search configuration to use your stopwords.

Once you've set up your search configuration correctly, your query will look something like this (this is the SELECT variant: Changing to UPDATE will be trivial):

SELECT employees.full_name, genders.first_name 
FROM employees
    LEFT JOIN genders ON
        TO_TSVECTOR('english_titles', employees.full_name) 
            @@ TO_TSQUERY('english_titles', genders.first_name)

This will give you the following results:

full_name           first_name          
"John Smith"        "John"
"Smith, John"       "John"
"Mr. John Smith"    "John"
"Dr. John Smith"    "John"
"Mrs. Jane Smith"   "Jane"
"Ms. Jane Smith"    "Jane"
"Jane Smith, Esq."  "Jane"
"Jane Smith, MD"    "Jane"
"David Johnson"     NULL

In order for this to work, you'll need to take the following steps:

  1. Create a stopwords file containing job titles, and put it in your $SHAREDIR/tsearch_data Postgres directory. See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS.
  2. Create a dictionary that uses this stopwords list (you can probably use the pg_catalog.simple as your template dictionary). See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY.
  3. Create a search configuration for job titles. See https://www.postgresql.org/docs/9.1/static/textsearch-configuration.html.
  4. Alter your search configuration to use the dictionary you created in Step 2 (cf. the link above).

Now, with all that said, you need to think carefully about a couple of things:

  • How do you expect to handle people whose last name matches a first name in your Genders table? For example, you have someone called John Stuart, and both John and Stuart are in your genders table. How do you expect to handle that?
  • How do you expect to handle people with nicknames, or only have one name? I would strongly encourage you to read Falsehoods Programmers Believe About Names to make sure you're not making any ill-founded assumptions.
  • Why is your table containing first named called genders? Do you expect to match people to first-name by gender? If so, that's a dangerous road to go down---there are names that can be used for either sex.

Upvotes: 3

Related Questions