Reputation: 12141
I'm trying to get consecutive capitalized words with one or more but it looks like it doesn't work for me.
def extract(string):
return re.findall('([A-Z][a-z]*(?=\s[A-Z])(?:\s+[A-Z][a-z]*)*)', string)
Here's my test case
def test_extract_capitalize_words(self):
keywords = extract('This is New York and this is London')
self.assertEquals(['New York', 'London'], keywords)
It only captures New York
and not London
Upvotes: 1
Views: 1090
Reputation: 138017
Here's a succinct option:
\b(?:[A-Z][a-z]*\b\s*)+
regex
module instead of re
, consider using \p{Lu}
and \p{Ll}
for Unicode uppercase and lowercase letters instead of [A-Z]
and [a-z]
.\b
at the start and middle are word boundaries, and are there to avoid matching words like McBain
or GOP
. If you want to match these' remove the second \b
.[a-z]*
is used for allowing single-letter words, like A
or I
. Use +
if you don't want them.(?<!\s)
to explicitly remove that space. A feature like \>
(word end) would have been more elegant, but Python (and most flavors) don't support it.Working example: https://regex101.com/r/sT1rS4/1
Upvotes: 1
Reputation: 174706
This would match the consecutive captitalized word or the capitalized word followed by an end of the line boundary.
>>> import re
>>> s = 'This is New York and this is London'
>>> re.findall(r'\b[A-Z][a-z]*\b(?:(?:\s+[A-Z][a-z]*\b)+|$)', s)
['New York', 'London']
Upvotes: 0