Jim McMullen
Jim McMullen

Reputation: 45

Regex for parsing names where last name has a prefix

I am learning regular expressions (c#) working in RegexBuddy (love it). I have been to trying to parse names with a very specific pattern. I know it cannot be made perfect, but I think am very close to what I want to accomplish.

Assumptions:

  1. the name pattern is FIRST [MIDDLE] LAST, all caps, where MIDDLE is optional and there is NO title or suffix
  2. I want to capture FIRST and MIDDLE to into a firstname value, and LAST into a lastname value
  3. FIRST and MIDDLE together may have any number of words
  4. I know that I cannot match multiple-word last names (which I am okay with) EXCEPT in 2 cases:
    • hyphenated last names
    • names in which a last name has a prefix ("EL GHAMRY SABE", "DE AMORIM SILVA", "DE LA HOYA" are actual examples from my data)

Here is my regex so far (using a few of the last-name prefixes):

^(?<first>[ A-Z]+?) (?<last>(?<pfx>(?:(?:EL|DE|LA) )*)[A-Z\-]+?)$

Which works well (capturing first, last and last-name-prefix) with:

JOHN SMITH
JOHN JAY SMITH
JOHN JAYEL SMITH
JOHN JAY SMITH-JONES
JOHN JAY JIMMY SMITH JONES  -- only "JONES" is in the last name, which is okay for this exercise
JOHN JAY EL AMIN
JOHN JAY DE LA HOYA  -- "DE LA HOYA" is the last name
JOHN JAY EL  -- a case where "EL" is actually the last name
JOHN EL AMIN

But fails on these two which have multi-part last names following the last name prefix (only the last word is captured in the lastname field):

JOHN JAY EL GHAMRY SABE
CICERO JOSE TORRES DE AMORIM SILVA

SO... 2 questions:

  1. How do I alter my expression so that IF there is a last name prefix that everything including and after the prefix ("EL","DE","LE", "DE LA", etc.) are included in the lastname field, and IF there is NO prefix, only the last word is included in the lastname field?
  2. As I am still learning, can you suggest other improvements to my regex?

Upvotes: 2

Views: 1898

Answers (2)

Cu3PO42
Cu3PO42

Reputation: 1473

Original answer

You can use negative lookahead to assert that the next 'middle name' you are matching is not one of the prefixes for the last name:

^(?<first>[A-Z]+(?:\s+(?!(?:EL|DE|LA)\s)[A-Z]+)*)\s+(?<last>[A-Z]+(?:(?:-|\s+)[A-Z]+)*)$

Also note that I changed the explicit use of spaces to \s, which is considered good practice, because most RegEx engines can be set to ignore whitespace completely to be able to format the expressions more nicely. In my RegEx I sticked to the syntax used by your program, even though the use of (?P<group_name>) is probably more common.

Reason for the bad performance

I just saw that while fixing the mistakes I made, you introduced enormous overhead in the execution of the RegEx. Originally my expression worked as you intended, but it included the space between the last middle name and the last name in the group of the first name, in an attempt to fix this I slightly changed the expression, which fixed the space problem, but I did not rerun all the tests and the name association went by unnoticed by me. And I honestly just read over the last name separated by hyphens part - even though this is obviously extremely easy to fix ;)
Back to my main point: While you made the RegEx function as intended your version uses lookahead after every character of the first name - even though it is only necessary after every space! And because lookahead is an extremely costly operation this makes the RegEx a lot slower.

How to improve the performance?

I fixed the RegEx myself - in a way that preserves the low number of lookahead operations - and while I was at it also improved the last name part, because I figured hyphens should only occur between letters and not anywhere, i.e. not as the last character or between spaces. And space should obviously not be the last character, either.

Benchmarks

To prove that my RegEx (the original and my fixed one at least) doesn't have bad performance at all I ran some benchmarks and unit tests. You can download the code here if you want to run the tests yourself to make sure I didn't cheat ;) The tests are written in Python, but the results should be similar in other RegEx engines.

Timing RegEx by Ron Rosenfeld (^(?P<First>(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?P<Last>\b[-A-Z\s]+)$):
=====================

 * Took 0.159s to run test case "JOHN SMITH" 100000 times.
 * Took 0.198s to run test case "JOHN JAY SMITH" 100000 times.
 * Took 0.209s to run test case "JOHN JAYEL SMITH" 100000 times.
 * Took 0.273s to run test case "JOHN JAY SMITH-JONES" 100000 times.
 * Took 0.274s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
 * Took 0.135s to run test case "JOHN JAY EL AMIN" 100000 times.
 * Took 0.143s to run test case "JOHN JAY DE LA HOYA" 100000 times.
 * Took 0.130s to run test case "JOHN JAY EL" 100000 times.
 * Took 0.109s to run test case "JOHN EL AMIN" 100000 times.
 * Took 0.146s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
 * Took 0.223s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 2.001s to run all tests 100000 times.


Timing RegEx by Jim McMullen (^(?P<first>(?:(?!(?:EL|DE|LA)\s)[A-Z]+\s?)+)\s+(?P<last>[A-Z\-\s]+)$):
=====================

 * Took 0.634s to run test case "JOHN SMITH" 100000 times.
 * Took 0.649s to run test case "JOHN JAY SMITH" 100000 times.
 * Took 0.659s to run test case "JOHN JAYEL SMITH" 100000 times.
 * Took 0.793s to run test case "JOHN JAY SMITH-JONES" 100000 times.
 * Took 0.689s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
 * Took 0.118s to run test case "JOHN JAY EL AMIN" 100000 times.
 * Took 0.126s to run test case "JOHN JAY DE LA HOYA" 100000 times.
 * Took 0.168s to run test case "JOHN JAY EL" 100000 times.
 * Took 0.100s to run test case "JOHN EL AMIN" 100000 times.
 * Took 0.123s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
 * Took 0.143s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 4.201s to run all tests 100000 times.


Timing RegEx by Cu3PO42 (^(?P<first>[A-Z]+(?:\s+(?!(?:EL|DE|LA)\s)[A-Z]+)*)\s+(?P<last>[A-Z]+(?:(?:-|\s+)[A-Z]+)*)$):
=====================

 * Took 0.157s to run test case "JOHN SMITH" 100000 times.
 * Took 0.176s to run test case "JOHN JAY SMITH" 100000 times.
 * Took 0.178s to run test case "JOHN JAYEL SMITH" 100000 times.
 * Took 0.199s to run test case "JOHN JAY SMITH-JONES" 100000 times.
 * Took 0.229s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
 * Took 0.148s to run test case "JOHN JAY EL AMIN" 100000 times.
 * Took 0.172s to run test case "JOHN JAY DE LA HOYA" 100000 times.
 * Took 0.136s to run test case "JOHN JAY EL" 100000 times.
 * Took 0.112s to run test case "JOHN EL AMIN" 100000 times.
 * Took 0.175s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
 * Took 0.200s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 1.881s to run all tests 100000 times.

As you can see you introduced over 100%(!) overhead while fixing my RegEx. My tests reveal that my RegEx is actually the fastest overall and the fastest in most test cases while providing slightly more features (concerning the last name), which also consume processing time.

Upvotes: 1

Ron Rosenfeld
Ron Rosenfeld

Reputation: 60199

I would match all up to the prefix as the first name (using negative look ahead), and then match the rest of the line into the last name.

^(?<First>(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?<Last>\b[-A-Z\s]+)$

Upvotes: 1

Related Questions