Reputation: 16462
I want to match possible names from a string. A name should be 2-4 words, each with 3 or more letters, all words capitalized. For example, given this list of strings:
Her name is Emily.
I work for Surya Soft.
I sent an email for Ery Wulandari.
Welcome to the Link Building Partner program!
I want a regex that returns:
None
Surya Soft
Ery Wulandari
Link Building Partner
currently here is my code:
data = [
'Her name is Emily.',
'I work for Surya Soft.',
'I sent an email for Ery Wulandari.',
'Welcome to the Link Building Partner program!'
]
for line in data:
print re.findall('(?:[A-Z][a-z0-9]{2,}\s+[A-Z][a-z0-9]{2,})', line)
It works for the first three lines, but it fail on the last line.
Upvotes: 2
Views: 110
Reputation: 1200
You can use grouping for repeating structure as given below:
compiled = re.compile('(?:(([A-Z][a-z0-9]{2,})\s*){2,})')
for line in data:
match = compiled.search(line)
if match:
print match.group()
else:
print None
Output:
None
Surya Soft
Ery Wulandari
Link Building Partner
Upvotes: 2
Reputation: 250951
Non-regex solution:
from string import punctuation as punc
def solve(strs):
words = [[]]
for i,x in enumerate(strs.split()):
x = x.strip(punc)
if x[0].isupper() and len(x)>2:
if words[-1] and words[-1][-1][0] == i-1:
words[-1].append((i,x))
else:
words.append([(i,x)])
names = [" ".join(y[1] for y in x) for x in words if 2 <= len(x) <= 4]
return ", ".join(names) if names else None
data = [
'Her name is Emily.',
'I work for Surya Soft.',
'I sent an email for Ery Wulandari.',
'Welcome to the Link Building Partner abc Fooo Foo program!'
]
for x in data:
print solve(x)
output:
None
Surya Soft
Ery Wulandari
Link Building Partner, Fooo Foo
Upvotes: 1
Reputation: 33397
You can use:
re.findall(r'((?:[A-Z]\w{2,}\s*){2,4})', line)
It may add a trailing whitespace that can be trimmed with .strip()
Upvotes: 2