Reputation: 39
I'm trying to extract all the first names AND the last names (ex: John Johnson) in a big text (about 20 pages).
I used split with \.
as separator and there is my regular expression:
\b([A-Z]{1}[a-z]+\s{1})([A-Z]{1}[a-z]+)\b
Unfortunately, I only get all the lines of my text instead of only the first names and last names:
Suddenly, Mary Poppins flew away with her umbrella
Later in the day, John.... bla bla bla
Could someone help me?
Upvotes: 3
Views: 10533
Reputation: 7521
I've adapted one regular expression that can handle accents and dash for composed names:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
r = re.compile('([A-Z]\w+(?=[\s\-][A-Z])(?:[\s\-][A-Z]\w+)+)',
re.UNICODE)
tests = {
u'Jean Vincent Placé': u'Jean Vincent Placé est un excellent donneur de leçons',
u'Giovanni Delle Bande Nere': u'In quest\'anno Giovanni Delle Bande Nere ha avuto tre momenti di gloria',
# Here 'BDFL' may not be whished
u'BDFL Guido Van Rossum': u'Nobody hacks Python like BDFL Guido Van Rossum because he created it'
}
for expected, s in tests.iteritems():
match = r.search(s)
assert(match is not None)
extracted = match.group(0)
print expected
print extracted
assert(expected == match.group(0))
Upvotes: 1
Reputation: 520
Try
regex = re.compile("\b([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)\b")
string = """Suddenly, Mary Poppins flew away with her umbrella
Later in the day, John Johnson did something."""
regex.findall(string)
The output I got was:
[(u'Mary', u'Poppins'), (u'John', u'Johnson')]
Upvotes: 2