Unable to scrape data from tables using BeautifulSoup find_all or Pandas.read_html functions

Question

Background:

I'm writing code to scrape text from the following webpage

https://www.pro-football-reference.com/boxscores/201809060phi.htm.

The page contains several tables, and I'd like to extract specific data from them. To illustrate the example here, I'm looking to pull data from the row labeled "Roof" in the Table labeled "Game Info".

The Attempt:

I've attempted to scrape this information in two ways: using BeautifulSoup (Also importing MechanicalSoup in my case as I have additional code to open links in a for loop), and using the Pandas modules.

Pandas Attempt:

import mechanicalsoup
from bs4 import BeautifulSoup
import pandas

root_url="https://www.pro-football-reference.com"

#Opens the main pro-football page with list of 2018 games
browser=mechanicalsoup.StatefulBrowser()
browser.open("https://www.pro-football-reference.com/years/2018/games.htm")
main_page = browser.get_current_page()
browser.close()
data=main_page.find_all("tr")

#Finds the link to box-score information for the first game.  
#Will iterate over all games in a for loop later on.
box_score_tag = data[1].find("td",{"data-stat":"boxscore_word"})
box_score_link = root_url+box_score_tag.a.get("href")

#Opens the box-score page for the first game
browser2=mechanicalsoup.StatefulBrowser()
browser2.open(box_score_link)
boxscorepage=browser2.get_current_page()
browser2.close()

#attempt to scrape all the tables using Pandas
tables = pandas.read_html(box_score_link)
print(len(tables))

The output with the Pandas Function is 3 (i.e. pulling only 3 tables) when clearly there are many more.

BeautifulSoup Attempt (replace last 3 lines)

#attempt to scrape the specific table in question using BeautifulSoup
game_info = boxscorepage.find_all("table",{"id":"game_info"})
print(game_info)

This outputs nothing - On this page, finding some tags (divs, spans, etc) works but others don't. In this case, it's not finding the table with game_info as intended.

Unable to scrape data from tables using BeautifulSoup find_all or Pandas.read_html functions

Answers (1)

Related Questions