Reputation: 45
I am learning regular expressions (c#) working in RegexBuddy (love it). I have been to trying to parse names with a very specific pattern. I know it cannot be made perfect, but I think am very close to what I want to accomplish.
Assumptions:
Here is my regex so far (using a few of the last-name prefixes):
^(?<first>[ A-Z]+?) (?<last>(?<pfx>(?:(?:EL|DE|LA) )*)[A-Z\-]+?)$
Which works well (capturing first, last and last-name-prefix) with:
JOHN SMITH
JOHN JAY SMITH
JOHN JAYEL SMITH
JOHN JAY SMITH-JONES
JOHN JAY JIMMY SMITH JONES -- only "JONES" is in the last name, which is okay for this exercise
JOHN JAY EL AMIN
JOHN JAY DE LA HOYA -- "DE LA HOYA" is the last name
JOHN JAY EL -- a case where "EL" is actually the last name
JOHN EL AMIN
But fails on these two which have multi-part last names following the last name prefix (only the last word is captured in the lastname field):
JOHN JAY EL GHAMRY SABE
CICERO JOSE TORRES DE AMORIM SILVA
SO... 2 questions:
Upvotes: 2
Views: 1898
Reputation: 1473
You can use negative lookahead to assert that the next 'middle name' you are matching is not one of the prefixes for the last name:
^(?<first>[A-Z]+(?:\s+(?!(?:EL|DE|LA)\s)[A-Z]+)*)\s+(?<last>[A-Z]+(?:(?:-|\s+)[A-Z]+)*)$
Also note that I changed the explicit use of spaces to \s
, which is considered good practice, because most RegEx engines can be set to ignore whitespace completely to be able to format the expressions more nicely. In my RegEx I sticked to the syntax used by your program, even though the use of (?P<group_name>)
is probably more common.
I just saw that while fixing the mistakes I made, you introduced enormous overhead in the execution of the RegEx. Originally my expression worked as you intended, but it included the space between the last middle name and the last name in the group of the first name, in an attempt to fix this I slightly changed the expression, which fixed the space problem, but I did not rerun all the tests and the name association went by unnoticed by me. And I honestly just read over the last name separated by hyphens part - even though this is obviously extremely easy to fix ;)
Back to my main point: While you made the RegEx function as intended your version uses lookahead after every character of the first name - even though it is only necessary after every space! And because lookahead is an extremely costly operation this makes the RegEx a lot slower.
I fixed the RegEx myself - in a way that preserves the low number of lookahead operations - and while I was at it also improved the last name part, because I figured hyphens should only occur between letters and not anywhere, i.e. not as the last character or between spaces. And space should obviously not be the last character, either.
To prove that my RegEx (the original and my fixed one at least) doesn't have bad performance at all I ran some benchmarks and unit tests. You can download the code here if you want to run the tests yourself to make sure I didn't cheat ;) The tests are written in Python, but the results should be similar in other RegEx engines.
Timing RegEx by Ron Rosenfeld (^(?P<First>(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?P<Last>\b[-A-Z\s]+)$):
=====================
* Took 0.159s to run test case "JOHN SMITH" 100000 times.
* Took 0.198s to run test case "JOHN JAY SMITH" 100000 times.
* Took 0.209s to run test case "JOHN JAYEL SMITH" 100000 times.
* Took 0.273s to run test case "JOHN JAY SMITH-JONES" 100000 times.
* Took 0.274s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
* Took 0.135s to run test case "JOHN JAY EL AMIN" 100000 times.
* Took 0.143s to run test case "JOHN JAY DE LA HOYA" 100000 times.
* Took 0.130s to run test case "JOHN JAY EL" 100000 times.
* Took 0.109s to run test case "JOHN EL AMIN" 100000 times.
* Took 0.146s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
* Took 0.223s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 2.001s to run all tests 100000 times.
Timing RegEx by Jim McMullen (^(?P<first>(?:(?!(?:EL|DE|LA)\s)[A-Z]+\s?)+)\s+(?P<last>[A-Z\-\s]+)$):
=====================
* Took 0.634s to run test case "JOHN SMITH" 100000 times.
* Took 0.649s to run test case "JOHN JAY SMITH" 100000 times.
* Took 0.659s to run test case "JOHN JAYEL SMITH" 100000 times.
* Took 0.793s to run test case "JOHN JAY SMITH-JONES" 100000 times.
* Took 0.689s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
* Took 0.118s to run test case "JOHN JAY EL AMIN" 100000 times.
* Took 0.126s to run test case "JOHN JAY DE LA HOYA" 100000 times.
* Took 0.168s to run test case "JOHN JAY EL" 100000 times.
* Took 0.100s to run test case "JOHN EL AMIN" 100000 times.
* Took 0.123s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
* Took 0.143s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 4.201s to run all tests 100000 times.
Timing RegEx by Cu3PO42 (^(?P<first>[A-Z]+(?:\s+(?!(?:EL|DE|LA)\s)[A-Z]+)*)\s+(?P<last>[A-Z]+(?:(?:-|\s+)[A-Z]+)*)$):
=====================
* Took 0.157s to run test case "JOHN SMITH" 100000 times.
* Took 0.176s to run test case "JOHN JAY SMITH" 100000 times.
* Took 0.178s to run test case "JOHN JAYEL SMITH" 100000 times.
* Took 0.199s to run test case "JOHN JAY SMITH-JONES" 100000 times.
* Took 0.229s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
* Took 0.148s to run test case "JOHN JAY EL AMIN" 100000 times.
* Took 0.172s to run test case "JOHN JAY DE LA HOYA" 100000 times.
* Took 0.136s to run test case "JOHN JAY EL" 100000 times.
* Took 0.112s to run test case "JOHN EL AMIN" 100000 times.
* Took 0.175s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
* Took 0.200s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 1.881s to run all tests 100000 times.
As you can see you introduced over 100%(!) overhead while fixing my RegEx. My tests reveal that my RegEx is actually the fastest overall and the fastest in most test cases while providing slightly more features (concerning the last name), which also consume processing time.
Upvotes: 1
Reputation: 60199
I would match all up to the prefix as the first name (using negative look ahead), and then match the rest of the line into the last name.
^(?<First>(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?<Last>\b[-A-Z\s]+)$
Upvotes: 1