Regex for parsing names where last name has a prefix

Question

I am learning regular expressions (c#) working in RegexBuddy (love it). I have been to trying to parse names with a very specific pattern. I know it cannot be made perfect, but I think am very close to what I want to accomplish.

Assumptions:

the name pattern is FIRST [MIDDLE] LAST, all caps, where MIDDLE is optional and there is NO title or suffix
I want to capture FIRST and MIDDLE to into a firstname value, and LAST into a lastname value
FIRST and MIDDLE together may have any number of words
I know that I cannot match multiple-word last names (which I am okay with) EXCEPT in 2 cases:
- hyphenated last names
- names in which a last name has a prefix ("EL GHAMRY SABE", "DE AMORIM SILVA", "DE LA HOYA" are actual examples from my data)

Here is my regex so far (using a few of the last-name prefixes):

^(?[ A-Z]+?) (?(?(?:(?:EL|DE|LA) )*)[A-Z\-]+?)$

Which works well (capturing first, last and last-name-prefix) with:

JOHN SMITH
JOHN JAY SMITH
JOHN JAYEL SMITH
JOHN JAY SMITH-JONES
JOHN JAY JIMMY SMITH JONES  -- only "JONES" is in the last name, which is okay for this exercise
JOHN JAY EL AMIN
JOHN JAY DE LA HOYA  -- "DE LA HOYA" is the last name
JOHN JAY EL  -- a case where "EL" is actually the last name
JOHN EL AMIN

But fails on these two which have multi-part last names following the last name prefix (only the last word is captured in the lastname field):

JOHN JAY EL GHAMRY SABE
CICERO JOSE TORRES DE AMORIM SILVA

SO... 2 questions:

How do I alter my expression so that IF there is a last name prefix that everything including and after the prefix ("EL","DE","LE", "DE LA", etc.) are included in the lastname field, and IF there is NO prefix, only the last word is included in the lastname field?
As I am still learning, can you suggest other improvements to my regex?

Ron Rosenfeld · Accepted Answer

I would match all up to the prefix as the first name (using negative look ahead), and then match the rest of the line into the last name.

^(?(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?\b[-A-Z\s]+)$

Regex for parsing names where last name has a prefix

Answers (2)

Original answer

Reason for the bad performance

How to improve the performance?

Benchmarks

Related Questions