user3682157
user3682157

Reputation: 1695

Extract all Capital Words from List (Python3)

I am trying to extract all capital movies from a list that I have scraped: I am trying to use regex to do so

wikis = ["http://www.boxofficemojo.com/daily/chart/"]
for wiki in wikis:
    website = requests.get(wiki)
    soup = BeautifulSoup(website.content, "lxml")
    text = ''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)])
    new =  re.sub(r'[^a-zA-Z \n]','',text)
    caps = re.findall('([A-Z]+(?=\s[A-Z]+)(?:\s[A-Z]+)+)', new)

However, my output is appending an extra capital letter at the end of my movies

'BEASTS OF NO NATIONN'
'EVEREST U'
'THE MARTIANF'

Not sure why but I know it has something to do with my regex code:

caps = re.findall('([A-Z]+(?=\s[A-Z]+)(?:\s[A-Z]+)+)', new)

How can I fix this?

Upvotes: 0

Views: 1953

Answers (2)

Mike Herring
Mike Herring

Reputation: 21

The problem is that soup.body.find_all(lambda tag: tag != 'script', recursive=False) only returns 3 elements. The third appears to be all of the text in the body with all tags stripped out. So your movie title is right up against your studio name, like this: THE MARTIANFox. So grabbing the caps from that would give you THE MARTIANF.

Also, just looking for caps you will miss things like MISSION: IMPOSSIBLE - ROGUE NATION because of the non-alpha characters.

How about this instead?

wikis = ["http://www.boxofficemojo.com/daily/chart/"]
for wiki in wikis:
    website = requests.get(wiki)
    caps = re.findall("<a href=\"/movies[^>]*>([^<a-z]*)</a>", website.content)

Each movie is inside an link that points to /movies, so that's an easy way to find them. <a href=\"movies[^>]*> will match the opening anchor tag, ([^<a-z]*) will match a string without lowercase characters inside the anchor tag (the movie title), and then </a> to close it out.

Upvotes: 1

R Nar
R Nar

Reputation: 5515

Use this instead.

caps = re.findall('([A-Z]+(?:(?!\s?[A-Z][a-z])\s?[A-Z])+)', new)

to make sure that the the next word is not just a capitalized word. I can't check this so I dont know for sure if it will work.

EDIT:

I apologize, the last one made no sense once I actually thought about it. It has been changed to one that should work

Upvotes: 2

Related Questions