Ozharu-Ad
Ozharu-Ad

Reputation: 3

Beautiful Soup, fetching table data from Wikipedia

I'm following the book "Practical Web Scraping for Data Science Best Practices and examples with Python" by by Seppe vanden Broucke and Bart Baesens.

The next code is supposed to fetch Data from Wikipedia, a list of Game Of Thrones episodes:

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
        for row in table.find_all('tr')[1:]:
            values = []
            for col in row.find_all(['th','td']):
                values.append(col.text)
                if values:
                    episode_dict = {headers[i]: values[i] for i in
                                    range(len(values))}
                    episodes.append(episode_dict)
                    for episode in episodes:
                        print(episode)

But at running the code the next errors shows:

{'No.overall': '1'}

IndexError Traceback (most recent call last)

<ipython-input-8-d2e64c7e0540> in <module>
     20                 if values:
     21                     episode_dict = {headers[i]: values[i] for i in
---> 22                                     range(len(values))}
     23                     episodes.append(episode_dict)
     24                     for episode in episodes:

<ipython-input-8-d2e64c7e0540> in <dictcomp>(.0)
     19                 values.append(col.text)
     20                 if values:
---> 21                     episode_dict = {headers[i]: values[i] for i in
     22                                     range(len(values))}
     23                     episodes.append(episode_dict)

IndexError: list index out of range

Could anyone tell why is this happening?

Upvotes: 0

Views: 170

Answers (2)

Ananth
Ananth

Reputation: 831

The problem is not with the code, it's with the indentation of the code. The third for loop should be in parallel with the second and not inside the second for loop. This is how it's shown in the book.

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    # Start by fetching the header cells from the first row to determine
    # the field names
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    # Then go through all the rows except the first one
    for row in table.find_all('tr')[1:]:
        values = []
        # And get the column cells, the first one being inside a th-tag
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in
        range(len(values))}
        episodes.append(episode_dict)
# Show the results
for episode in episodes:
 print(episode)

Upvotes: 1

karlcow
karlcow

Reputation: 6972

Your trace is

{'No.overall': '1'}
Traceback (most recent call last):
  File "/Users/karl/code/deleteme/foo.py", line 20, in <module>
    episode_dict = {headers[i]: values[i] for i in
  File "/Users/karl/code/deleteme/foo.py", line 20, in <dictcomp>
    episode_dict = {headers[i]: values[i] for i in
IndexError: list index out of range

The code is probably too indented and a bit hard to read with the choice of variables. And it would be useful to know what you are trying to extract exactly. The list of episodes? Since the book maybe the table structure has changed.

If yes, then each revelant episode title has this shape.

<td class="summary" style="text-align:left">"<a href="/wiki/Stormborn" title="Stormborn">Stormborn</a>"</td>
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
        for row in table.find_all('tr')[1:]:
            values = []
            for col in row.find_all('td', class_='summary'):
                print(col.text)

Upvotes: 0

Related Questions