Reputation: 119
Hi I'd like to get movie titles from this website:
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")
for i in range(len(movie_list)):
print(movie_list[i].text)
I got response 200 and have no problem crawling other information. but the problem is in the variable movie_list.
When I print(movie_list), it returns just empty list, which means I'm using the tag wrong.
Upvotes: 2
Views: 720
Reputation: 312500
If you replace:
movie_list = html.select("#page_filling_chart > table > tbody > tr > td > b > a")
With:
movie_list = html.select("#page_filling_chart table tr > td > b > a")
You get what I think you're looking for. The primary change here is replacing child-selectors (parent > child
) with descendant selectors (ancestor descendant
), which is a lot more forgiving with respect to what the intervening content looks like.
Update: this is interesting. Your choice of BeautifulSoup
parser seems to lead to different behavior.
Compare:
>>> html = BeautifulSoup(raw, 'html.parser')
>>> html.select('#page_filling_chart > table')
[]
With:
>>> html = BeautifulSoup(raw, 'lxml')
>>> html.select('#page_filling_chart > table')
[<table>
<tr><th>Rank</th><th>Movie</th><th>Release<br/>Date</th><th>Distributor</th><th>Genre</th><th>2019 Gross</th><th>Tickets Sold</th></tr>
<tr>
[...]
In fact, using the lxml
parser you can almost use your original selector. This works:
html.select("#page_filling_chart > table > tr > td > b > a"
After parsing, a table
has no tbody
.
After experimenting for a bit, you would have to rewrite your original query like this to get it to work with html.parser
:
html.select("#page_filling_chart2 > p > p > p > p > p > table > tr > td > b > a")
It looks like html.parser
doesn't synthesize closing </p>
elements when they are missing from the source, so all the unclosed <p>
tags result in a weird parsed document structure.
Upvotes: 5
Reputation: 494
Here is the solution for this question:
from bs4 import BeautifulSoup
import requests
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_table_rows = html.findAll("table")[0].findAll('tr')
movie_list = []
for tr in movie_table_rows[1:]:
tds = tr.findAll('td')
movie_list.append(tds[1].text) #Extract Movie Names
print(movie_list)
Basically, the way you are trying to extract the text is incorrect as selectors are different for each movie name anchor tag.
Upvotes: 2
Reputation: 123
This should work:
url = 'https://www.the-numbers.com/market/2019/top-grossing-movies'
raw = requests.get(url)
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("table > tr > td > b > a")
for i in range(len(movie_list)):
print(movie_list[i].text)
Upvotes: 2