flowfree
flowfree

Reputation: 16462

Regex to match possible names from a string

I want to match possible names from a string. A name should be 2-4 words, each with 3 or more letters, all words capitalized. For example, given this list of strings:

Her name is Emily.
I work for Surya Soft.
I sent an email for Ery Wulandari.
Welcome to the Link Building Partner program!

I want a regex that returns:

None
Surya Soft
Ery Wulandari
Link Building Partner

currently here is my code:

data = [
   'Her name is Emily.', 
   'I work for Surya Soft.', 
   'I sent an email for Ery Wulandari.', 
   'Welcome to the Link Building Partner program!'
]

for line in data:
    print re.findall('(?:[A-Z][a-z0-9]{2,}\s+[A-Z][a-z0-9]{2,})', line)

It works for the first three lines, but it fail on the last line.

Upvotes: 2

Views: 110

Answers (4)

ashokadhikari
ashokadhikari

Reputation: 1200

You can use grouping for repeating structure as given below:

compiled = re.compile('(?:(([A-Z][a-z0-9]{2,})\s*){2,})')
for line in data:
    match = compiled.search(line)
    if match:
       print match.group()
    else:
       print None

Output:

None
Surya Soft
Ery Wulandari
Link Building Partner 

Upvotes: 2

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250951

Non-regex solution:

from string import punctuation as punc
def solve(strs):
   words = [[]]
   for i,x in enumerate(strs.split()):
      x = x.strip(punc)
      if x[0].isupper() and len(x)>2:
         if words[-1] and words[-1][-1][0] == i-1:
            words[-1].append((i,x))
         else:
            words.append([(i,x)])

   names = [" ".join(y[1] for y in x) for x in words if 2 <= len(x) <= 4]
   return ", ".join(names) if names else None


data = [
   'Her name is Emily.', 
   'I work for Surya Soft.', 
   'I sent an email for Ery Wulandari.', 
   'Welcome to the Link Building Partner abc Fooo Foo program!'
]
for x in data:
   print solve(x)

output:

None
Surya Soft
Ery Wulandari
Link Building Partner, Fooo Foo

Upvotes: 1

Adobe
Adobe

Reputation: 13477

for line in data:
    print re.findall("[A-Z][\w]+", line)

Upvotes: 0

JBernardo
JBernardo

Reputation: 33397

You can use:

re.findall(r'((?:[A-Z]\w{2,}\s*){2,4})', line)

It may add a trailing whitespace that can be trimmed with .strip()

Upvotes: 2

Related Questions