Extract all Capital Words from List (Python3)

Question

I am trying to extract all capital movies from a list that I have scraped: I am trying to use regex to do so

wikis = ["http://www.boxofficemojo.com/daily/chart/"]
for wiki in wikis:
    website = requests.get(wiki)
    soup = BeautifulSoup(website.content, "lxml")
    text = ''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)])
    new =  re.sub(r'[^a-zA-Z 
]','',text)
    caps = re.findall('([A-Z]+(?=\s[A-Z]+)(?:\s[A-Z]+)+)', new)

However, my output is appending an extra capital letter at the end of my movies

'BEASTS OF NO NATIONN'
'EVEREST U'
'THE MARTIANF'

Not sure why but I know it has something to do with my regex code:

caps = re.findall('([A-Z]+(?=\s[A-Z]+)(?:\s[A-Z]+)+)', new)

How can I fix this?

R Nar · Accepted Answer

Use this instead.

caps = re.findall('([A-Z]+(?:(?!\s?[A-Z][a-z])\s?[A-Z])+)', new)

to make sure that the the next word is not just a capitalized word. I can't check this so I dont know for sure if it will work.

EDIT:

I apologize, the last one made no sense once I actually thought about it. It has been changed to one that should work

Extract all Capital Words from List (Python3)

Answers (2)

Related Questions