Reputation: 3
I'm following the book "Practical Web Scraping for Data Science Best Practices and examples with Python" by by Seppe vanden Broucke and Bart Baesens.
The next code is supposed to fetch Data from Wikipedia, a list of Game Of Thrones episodes:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
for header in table.find('tr').find_all('th'):
headers.append(header.text)
for row in table.find_all('tr')[1:]:
values = []
for col in row.find_all(['th','td']):
values.append(col.text)
if values:
episode_dict = {headers[i]: values[i] for i in
range(len(values))}
episodes.append(episode_dict)
for episode in episodes:
print(episode)
But at running the code the next errors shows:
{'No.overall': '1'}
IndexError Traceback (most recent call last)
<ipython-input-8-d2e64c7e0540> in <module>
20 if values:
21 episode_dict = {headers[i]: values[i] for i in
---> 22 range(len(values))}
23 episodes.append(episode_dict)
24 for episode in episodes:
<ipython-input-8-d2e64c7e0540> in <dictcomp>(.0)
19 values.append(col.text)
20 if values:
---> 21 episode_dict = {headers[i]: values[i] for i in
22 range(len(values))}
23 episodes.append(episode_dict)
IndexError: list index out of range
Could anyone tell why is this happening?
Upvotes: 0
Views: 170
Reputation: 831
The problem is not with the code, it's with the indentation of the code. The third for
loop should be in parallel with the second and not inside the second for
loop. This is how it's shown in the book.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
# Start by fetching the header cells from the first row to determine
# the field names
for header in table.find('tr').find_all('th'):
headers.append(header.text)
# Then go through all the rows except the first one
for row in table.find_all('tr')[1:]:
values = []
# And get the column cells, the first one being inside a th-tag
for col in row.find_all(['th','td']):
values.append(col.text)
if values:
episode_dict = {headers[i]: values[i] for i in
range(len(values))}
episodes.append(episode_dict)
# Show the results
for episode in episodes:
print(episode)
Upvotes: 1
Reputation: 6972
Your trace is
{'No.overall': '1'}
Traceback (most recent call last):
File "/Users/karl/code/deleteme/foo.py", line 20, in <module>
episode_dict = {headers[i]: values[i] for i in
File "/Users/karl/code/deleteme/foo.py", line 20, in <dictcomp>
episode_dict = {headers[i]: values[i] for i in
IndexError: list index out of range
The code is probably too indented and a bit hard to read with the choice of variables. And it would be useful to know what you are trying to extract exactly. The list of episodes? Since the book maybe the table structure has changed.
If yes, then each revelant episode title has this shape.
<td class="summary" style="text-align:left">"<a href="/wiki/Stormborn" title="Stormborn">Stormborn</a>"</td>
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
headers = []
rows = table.find_all('tr')
for header in table.find('tr').find_all('th'):
headers.append(header.text)
for row in table.find_all('tr')[1:]:
values = []
for col in row.find_all('td', class_='summary'):
print(col.text)
Upvotes: 0