Reputation: 9087
I am having trouble with my regex for capturing consecutive capitalized words. Here is what I want the regex to capture:
"said Polly Pocket and the toys" -> Polly Pocket
Here is the regex I am using:
re.findall('said ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)', article)
It returns the following:
[('Polly Pocket', ' Pocket')]
I want it to return:
['Polly Pocket']
Upvotes: 12
Views: 11315
Reputation: 61
$mystring = "the United States of America has many big cities like New York and Los Angeles, and others like Atlanta";
@phrases = $mystring =~ /[A-Z][\w'-]\*(?:\s+[A-Z][\w'-]\*)\*/g;
print "\n" . join(", ", @phrases) . "\n\n# phrases = " . scalar(@phrases) . "\n\n";
OUTPUT:
$ ./try_me.pl
United States, America, New York, Los Angeles, Atlanta
\# phrases = 5
Upvotes: 5
Reputation: 56915
It's because findall
returns all the capturing groups in your regex, and you have two capturing groups (one that gets all the matching text, and the inner one for subsequent words).
You can just make your second capturing group into a non-capturing one by using (?:regex)
instead of (regex)
:
re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', article)
Upvotes: 7
Reputation: 101614
Use a positive look-ahead:
([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)
Assert that the current word, to be accepted, needs to be followed by another word with a capital letter in it. Broken down:
( # begin capture
[A-Z] # one uppercase letter \ First Word
[a-z]+ # 1+ lowercase letters /
(?=\s[A-Z]) # must have a space and uppercase letter following it
(?: # non-capturing group
\s # space
[A-Z] # uppercase letter \ Additional Word(s)
[a-z]+ # lowercase letter /
)+ # group can be repeated (more words)
) #end capture
Upvotes: 31