Reputation: 85
I am new to Regular Expression and I have kind of a phone directory. I want to extract the names out of it. I wrote this (below), but it extracts lots of unwanted text rather than just names. Can you kindly tell me what am i doing wrong and how to correct it? Here is my code:
import re
directory = '''Mark Adamson
Home: 843-798-6698
(424) 345-7659
265-1864 ext. 4467
326-665-8657x2986
E-mail:[email protected]
Allison Andrews
Home: 612-321-0047
E-mail: [email protected]
Cellular: 612-393-0029
Dustin Andrews'''
nameRegex = re.compile('''
(
[A-Za-z]{2,25}
\s
([A-Za-z]{2,25})+
)
''',re.VERBOSE)
print(nameRegex.findall(directory))
the output it gives is:
[('Mark Adamson', 'Adamson'), ('net\nAllison', 'Allison'), ('Andrews\nHome', 'Home'), ('com\nCellular', 'Cellular'), ('Dustin Andrews', 'Andrews')]
Would be really grateful for help!
Upvotes: 4
Views: 1877
Reputation: 44043
Try:
nameRegex = re.compile('^((?:\w+\s*){2,})$', flags=re.MULTILINE)
This will only choose complete lines that are made up of two or more names composed of 'word' characters.
Upvotes: 0
Reputation: 5319
The following regex works as expected.
Related part of the code:
nameRegex = re.compile(r"^[a-zA-Z]+[',. -][a-zA-Z ]?[a-zA-Z]*$", re.MULTILINE)
print(nameRegex.findall(directory)
Output:
>>> python3 test.py
['Mark Adamson', 'Allison Andrews', 'Dustin Andrews']
Upvotes: 0
Reputation: 347
Your problem is that \s
will also match newlines. Instead of \s
just add a space. That is
name_regex = re.compile('[A-Za-z]{2,25} [A-Za-z]{2,25}')
This works if the names have exactly two words. If the names have more than two words (middle names or hyphenated last names) then you may want to expand this to something like:
name_regex = re.compile(r"^([A-Za-z \-]{2,25})+$", re.MULTILINE)
This looks for one or more words and will stretch from the beginning to end of a line (e.g. will not just get 'John Paul' from 'John Paul Jones')
Upvotes: 6
Reputation: 401
I can suggest to try the next regex, it works for me:
"([A-Z][a-z]+\s[A-Z][a-z]+)"
Upvotes: 0