Reputation: 9087
I only want to capture the words that are capitalized that are not in parentheses:
Reggie (Reginald) Potter -> Reggie Potter
I am using this regex:
test = re.findall('([A-Z][a-z]+(?:\s\(.*?\))?(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)', 'Reggie (Reginald) Potter')
I get this back:
Reggie (Reginald) Potter
I thought since this is non capturing:
(?:\s\(.*?\))
I wouldn't get back anything inside of the parentheses
Upvotes: 0
Views: 253
Reputation: 14458
I would use a simpler regex plus a list comprehension:
all_words = re.findall(r'(\(?\b[A-Z][a-z]+\b\)?)', 'Reggie (Reginald) Potter')
good_matches = [word for word in all_words if len(word) > 0 and not (word[0] == '(' and word[-1] == ')')]
Now good_matches
is ['Reggie', 'Potter']
, as expected.
Upvotes: 0
Reputation: 33908
If the words you want to avoid are directly adjacent to parentheses, you could use negative look-behinds and look-aheads to match the ones that are not in parentheses:
(?<!\()\b([A-Z][a-z]+)\b(?!\))
Upvotes: 2