ESPN.com Python web scraping issue

Question

I am trying to pull data for the rosters for all college football teams because I want to run some analysis on team performance based on composition of their roster.

My script is working on the first page, and it iterates over each team and can open the rosters link for each team, but then the Beautiful Soup commands I am running on the rosters page for a team keep throwing Index Errors. When I look at the HTML, it seems as if the commands I am writing should work yet when I print the page source from the Beautiful Soup I don't see what I see in Developer Tools in Chrome. Is this some instance of JS being used to serve up the content? If so, I thought Selenium got around this?

My code...

import requests
import csv
from bs4 import BeautifulSoup
from selenium import webdriver

teams_driver = webdriver.Firefox()
teams_driver.get("http://www.espn.com/college-football/teams")
teams_html = teams_driver.page_source
teams_soup = BeautifulSoup(teams_html, "html5lib")

i = 0

for link_html in teams_soup.find_all('a'):
    if link_html.text == 'Roster':
        roster_link = 'https://www.espn.com' + link_html['href']

        roster_driver = webdriver.Firefox()
        roster_driver.get(roster_link)
        roster_html = teams_driver.page_source
        roster_soup = BeautifulSoup(roster_html, "html5lib")

        team_name_html = roster_soup.find_all('a', class_='sub-brand-title')[0]
        team_name = team_name_html.find_all('b')[0].text

        for player_html in roster_soup.find_all('tr', class_='oddrow'):
            player_name = player_html.find_all('a')[0].text
            player_pos = player_html.find_all('td')[2].text
            player_height = player_html.find_all('td')[3].text
            player_weight = player_html.find_all('td')[4].text
            player_year = player_html.find_all('td')[5].text
            player_hometown = player_html.find_all('td')[6].text

            print(team_name)
            print('	', player_name)

        roster_driver.close()

teams_driver.close()

t.m.adam · Accepted Answer

In your for loop you're using the html of the 1st page (roster_html = teams_driver.page_source), so you get an index error when you try to select the 1st item of team_name_html because find_all returns an empty list.

Also you don't need to have all those instances of Firefox open, you can close the driver when you have the html.

teams_driver = webdriver.Firefox()
teams_driver.get("http://www.espn.com/college-football/teams")
teams_html = teams_driver.page_source
teams_driver.quit()

But you don't have to use selenium for this task, you can get all the data with requests and bs4.

import requests
from bs4 import BeautifulSoup

r = requests.get("http://www.espn.com/college-football/teams")
teams_soup = BeautifulSoup(r.text, "html5lib")

for link_html in teams_soup.find_all('a'):
    if link_html.text == 'Roster':
        roster_link = 'https://www.espn.com' + link_html['href']
        r = requests.get(roster_link)
        roster_soup = BeautifulSoup(r.text, "html5lib")

        team_name = roster_soup.find('a', class_='sub-brand-title').find('b').text
        for player_html in roster_soup.find_all('tr', class_='oddrow'):
            player_name = player_html.find_all('a')[0].text
            player_pos = player_html.find_all('td')[2].text
            player_height = player_html.find_all('td')[3].text
            player_weight = player_html.find_all('td')[4].text
            player_year = player_html.find_all('td')[5].text
            player_hometown = player_html.find_all('td')[6].text
            print(team_name, player_name, player_pos, player_height, player_weight, player_year, player_hometown)

ESPN.com Python web scraping issue

Answers (1)

Related Questions