Trance339
Trance339

Reputation: 307

Parsing names with Perl regex

Trying to write a regex that can parse a full name and split it into first name, middle name, last name. This should be easy but its pretty hard once you see the kind of names I have to parse. Now I could write a big long regex that takes into accout all these different cases but I think a smaller dynamic regex is possible and that's why I am here asking for some help.

I think these are all of the types of names I have to grab.

Some example names that need to be parsed are(each have three commas at the end):

(first name) (middle intial). (last name),,, //one middle initial with period after
(first name) (last name),,,                  //simple first and last
(No name),,,                                 //no name
(first name) (last name)-(last name),,,      //two last names separated by a dash
(first name) (middle initial). (middle initial). (last name),,,   //two middle initials with space inbetween
(first name) (last name w/ apostrophe),,,    //Last names with apostrophes 
(first name) (Middle name) (Last name),,,    //first middle and last name

Upvotes: 2

Views: 2458

Answers (3)

daxim
daxim

Reputation: 39158

use 5.010;
use DDS;
for (<DATA>) {
    chomp;
    s/,,,.*//;
    if (' ' eq $_) {
        say 'no name';
    } else {
        /\A (?<first>\S+) \s+ (?<middle>.*?)? (?:\s+)? (?<last>\S+) \z/msx;
        DumpLex \%+;
    }
}

__DATA__
Foo B. Baz,,,
Fnord Quux,,,
 ,,,
Xyzzy Bling-Bling,,,
Abe C. D. Efg,,,
Ed O'postrophe,,,
First Middle Last,,,

$HASH1 = {
           first  => 'Foo',
           last   => 'Baz',
           middle => 'B.'
         };
$HASH1 = {
           first  => 'Fnord',
           last   => 'Quux',
           middle => ''
         };
no name
$HASH1 = {
           first  => 'Xyzzy',
           last   => 'Bling-Bling',
           middle => ''
         };
$HASH1 = {
           first  => 'Abe',
           last   => 'Efg',
           middle => 'C. D.'
         };
$HASH1 = {
           first  => 'Ed',
           last   => 'O\'postrophe',
           middle => ''
         };
$HASH1 = {
           first  => 'First',
           last   => 'Last',
           middle => 'Middle'
         };

Upvotes: 3

Ether
Ether

Reputation: 53976

You can't parse something that ultimately follows no rules and hope to have any success. The problem is not translating the algorithm to a regular expression, but writing the algorithm to begin with.

Consider: how would you write an algorithm that could properly parse all these names into Given, Middle, and Family names?

  • Bob Mac Intosh
  • Mary Jane Watson
  • Thurston Powell III
  • Michael van der Velden
  • Jacqueline Kennedy Onassis
  • Dr. Jean Grey
  • Takahashi Shiro
  • Michel La Fontaine
  • Sir Alec Guinness
  • Mary-Sue Bowes-Lyon
  • Sacha Baron Cohen
  • Jack Arnold Jr.

See what I mean? You'd need an AI to be able to properly chunk each of these words into the proper context. Some people use two names as their "given" name. Some people use titles or honorifics, and some cultures place their family name first and given name last.

Summary: Don't do it. If you cannot get the user to separate their name into specific chunks for you, you must treat them as atoms.

Upvotes: 4

Ian Stuart
Ian Stuart

Reputation: 54

No code, but try:

  1. use substr to remove the last three characters off $name,
  2. @array = split /[\s+.]+/, $name # split on space and/or dots (as mentioned above) into an array,
  3. if ($array[0]) then you have a name,
  4. $lastname = pop @array; # gets the last (or only) name
  5. $firstname = shift @array if scalar @array; # first name is first element
  6. @array now contains all middle names and/or initials

Something like that, anyway...

Upvotes: 3

Related Questions