Reputation: 1695
I am trying to extract all capital movies from a list that I have scraped: I am trying to use regex to do so
wikis = ["http://www.boxofficemojo.com/daily/chart/"]
for wiki in wikis:
website = requests.get(wiki)
soup = BeautifulSoup(website.content, "lxml")
text = ''.join([element.text for element in soup.body.find_all(lambda tag: tag != 'script', recursive=False)])
new = re.sub(r'[^a-zA-Z \n]','',text)
caps = re.findall('([A-Z]+(?=\s[A-Z]+)(?:\s[A-Z]+)+)', new)
However, my output is appending an extra capital letter at the end of my movies
'BEASTS OF NO NATIONN'
'EVEREST U'
'THE MARTIANF'
Not sure why but I know it has something to do with my regex code:
caps = re.findall('([A-Z]+(?=\s[A-Z]+)(?:\s[A-Z]+)+)', new)
How can I fix this?
Upvotes: 0
Views: 1953
Reputation: 21
The problem is that soup.body.find_all(lambda tag: tag != 'script', recursive=False)
only returns 3 elements. The third appears to be all of the text in the body with all tags stripped out. So your movie title is right up against your studio name, like this: THE MARTIANFox. So grabbing the caps from that would give you THE MARTIANF.
Also, just looking for caps you will miss things like MISSION: IMPOSSIBLE - ROGUE NATION because of the non-alpha characters.
How about this instead?
wikis = ["http://www.boxofficemojo.com/daily/chart/"]
for wiki in wikis:
website = requests.get(wiki)
caps = re.findall("<a href=\"/movies[^>]*>([^<a-z]*)</a>", website.content)
Each movie is inside an link that points to /movies, so that's an easy way to find them. <a href=\"movies[^>]*>
will match the opening anchor tag, ([^<a-z]*)
will match a string without lowercase characters inside the anchor tag (the movie title), and then </a>
to close it out.
Upvotes: 1
Reputation: 5515
Use this instead.
caps = re.findall('([A-Z]+(?:(?!\s?[A-Z][a-z])\s?[A-Z])+)', new)
to make sure that the the next word is not just a capitalized word. I can't check this so I dont know for sure if it will work.
EDIT:
I apologize, the last one made no sense once I actually thought about it. It has been changed to one that should work
Upvotes: 2