toy
toy

Reputation: 12141

Regex to get consecutive capitalized words with one or more words doesn't work

I'm trying to get consecutive capitalized words with one or more but it looks like it doesn't work for me.

def extract(string):
    return re.findall('([A-Z][a-z]*(?=\s[A-Z])(?:\s+[A-Z][a-z]*)*)', string)

Here's my test case

def test_extract_capitalize_words(self):
    keywords = extract('This is New York and this is London')
    self.assertEquals(['New York', 'London'], keywords)

It only captures New York and not London

Upvotes: 1

Views: 1090

Answers (2)

Kobi
Kobi

Reputation: 138017

Here's a succinct option:

\b(?:[A-Z][a-z]*\b\s*)+
  • If you are using the regex module instead of re, consider using \p{Lu} and \p{Ll} for Unicode uppercase and lowercase letters instead of [A-Z] and [a-z].
  • The \b at the start and middle are word boundaries, and are there to avoid matching words like McBain or GOP. If you want to match these' remove the second \b.
  • [a-z]* is used for allowing single-letter words, like A or I. Use + if you don't want them.
  • The pattern captures an additional space at the end of the match. You can use (?<!\s) to explicitly remove that space. A feature like \> (word end) would have been more elegant, but Python (and most flavors) don't support it.

Working example: https://regex101.com/r/sT1rS4/1

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174706

This would match the consecutive captitalized word or the capitalized word followed by an end of the line boundary.

>>> import re
>>> s = 'This is New York and this is London'
>>> re.findall(r'\b[A-Z][a-z]*\b(?:(?:\s+[A-Z][a-z]*\b)+|$)', s)
['New York', 'London']

Upvotes: 0

Related Questions