MQaiser
MQaiser

Reputation: 131

Extract names from string with python Regex

I've been trying to extract names from a string, but don't seem to be close to success.

Here is the code:

string = "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
regex = re.compile(r'([A-Z][a-z]+(?: [A-Z][a-z]\.)? [A-Z][a-z]+)')
print(regex.findall(string))

This is the output I'm getting:

['Moe Szyslak', 'Timothy Lovejoy', 'Ned Flanders', 'Julius Hibbert']

Upvotes: 5

Views: 29934

Answers (4)

DYZ
DYZ

Reputation: 57033

Extracting human names even in English is notoriously hard. The following regex solves your particular problem but may fail on other inputs (e.g., it does not capture names with dashes):

re.findall(r"[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+", string)
#['Moe Szyslak', 'Burns, C. Montgomery', 'Timothy Lovejoy', 
# 'Ned Flanders', 'Simpson, Homer', 'Julius Hibbert']

And with titles:

TITLE = r"(?:[A-Z][a-z]*\.\s*)?"
NAME1 = r"[A-Z][a-z]+,?\s+"
MIDDLE_I = r"(?:[A-Z][a-z]*\.?\s*)?"
NAME2 = r"[A-Z][a-z]+"

re.findall(TITLE + NAME1 + MIDDLE_I + NAME2, string)
#['Moe Szyslak', 'Burns, C. Montgomery', 'Rev. Timothy Lovejoy', 
# 'Ned Flanders', 'Simpson, Homer', 'Dr. Julius Hibbert']

As a side note, there is no need to compile a regex unless you plan to reuse it.

Upvotes: 10

gregory
gregory

Reputation: 12895

Fancy regexes take time to compose and are difficult to maintain. In this case, I'd tend to keep it simple:

re.findall(r"[^()0-9-]+", string)

output:

['Moe Szyslak', ' ', 'Burns, C. Montgomery', ' ', 'Rev. Timothy Lovejoy', ' ', 'Ned Flanders', 'Simpson, Homer', 'Dr. Julius Hibbert']

If the blanks are an issue, I'd filter the list(filter(str.strip,list))

Upvotes: 6

Lena
Lena

Reputation: 182

I am extracting entities for instance names with spacy in no time. With spacy you can rely on pretrained language models, which have a massive knowledge about common names and titles.

  1. Step: set up spacy and download pretrained English language model import spacy
    import en_core_web_sm nlp = en_core_web_sm.load()

  2. Step: create spacy document doc = nlp('555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert')

  3. Step: get tags for all tokens in document which are labelled as person print([(X.text, X.label_) for X in doc.ents if X.label_ == PERSON])

Upvotes: -1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521249

Here is one approach using zero width lookarounds to isolate each name:

string = "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
result = re.findall(r'(?:(?<=^)|(?<=[^A-Za-z.,]))[A-Za-z.,]+(?: [A-Za-z.,]+)*(?:(?=[^A-Za-z.,])|(?=$))', string)

print(result)

['Moe Szyslak', 'Burns, C. Montgomery', 'Rev. Timothy Lovejoy', 'Ned Flanders',
 'Simpson, Homer', 'Dr. Julius Hibbert']

The actual pattern matched is this:

[A-Za-z.,]+(?: [A-Za-z.,]+)*

This says to match any uppercase or lowercase letter, dot, or period, followed by a space and one or more of the same character, zero or more times.

In addition, we use the following lookarounds on the left and right of this pattern:

(?:(?<=^)|(?<=[^A-Za-z.,]))
Lookbehind and assert either the start of the string, or a non matching character
(?:(?=[^A-Za-z.,])|(?=$))
Lookahead and asser either the end of the string or a non matching character

Upvotes: 1

Related Questions