Beautiful Soup, fetching table data from Wikipedia

Question

I'm following the book "Practical Web Scraping for Data Science Best Practices and examples with Python" by by Seppe vanden Broucke and Bart Baesens.

The next code is supposed to fetch Data from Wikipedia, a list of Game Of Thrones episodes:

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
        for row in table.find_all('tr')[1:]:
            values = []
            for col in row.find_all(['th','td']):
                values.append(col.text)
                if values:
                    episode_dict = {headers[i]: values[i] for i in
                                    range(len(values))}
                    episodes.append(episode_dict)
                    for episode in episodes:
                        print(episode)

But at running the code the next errors shows:

{'No.overall': '1'}

IndexError Traceback (most recent call last)

 in 
     20                 if values:
     21                     episode_dict = {headers[i]: values[i] for i in
---> 22                                     range(len(values))}
     23                     episodes.append(episode_dict)
     24                     for episode in episodes:

 in (.0)
     19                 values.append(col.text)
     20                 if values:
---> 21                     episode_dict = {headers[i]: values[i] for i in
     22                                     range(len(values))}
     23                     episodes.append(episode_dict)

IndexError: list index out of range

Could anyone tell why is this happening?

Ananth · Accepted Answer

The problem is not with the code, it's with the indentation of the code. The third for loop should be in parallel with the second and not inside the second for loop. This is how it's shown in the book.

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')
# We'll use a list to store our episode list
episodes = []
ep_tables = html_soup.find_all('table', class_='wikitable plainrowheaders wikiepisodetable')
for table in ep_tables:
    headers = []
    rows = table.find_all('tr')
    # Start by fetching the header cells from the first row to determine
    # the field names
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    # Then go through all the rows except the first one
    for row in table.find_all('tr')[1:]:
        values = []
        # And get the column cells, the first one being inside a th-tag
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in
        range(len(values))}
        episodes.append(episode_dict)
# Show the results
for episode in episodes:
 print(episode)

Beautiful Soup, fetching table data from Wikipedia

Answers (2)

Related Questions