crosscheck9
crosscheck9

Reputation: 1

Unable to scrape data from tables using BeautifulSoup find_all or Pandas.read_html functions

Background:

I'm writing code to scrape text from the following webpage

https://www.pro-football-reference.com/boxscores/201809060phi.htm.

The page contains several tables, and I'd like to extract specific data from them. To illustrate the example here, I'm looking to pull data from the row labeled "Roof" in the Table labeled "Game Info".

The Attempt:

I've attempted to scrape this information in two ways: using BeautifulSoup (Also importing MechanicalSoup in my case as I have additional code to open links in a for loop), and using the Pandas modules.

Pandas Attempt:

import mechanicalsoup
from bs4 import BeautifulSoup
import pandas

root_url="https://www.pro-football-reference.com"

#Opens the main pro-football page with list of 2018 games
browser=mechanicalsoup.StatefulBrowser()
browser.open("https://www.pro-football-reference.com/years/2018/games.htm")
main_page = browser.get_current_page()
browser.close()
data=main_page.find_all("tr")

#Finds the link to box-score information for the first game.  
#Will iterate over all games in a for loop later on.
box_score_tag = data[1].find("td",{"data-stat":"boxscore_word"})
box_score_link = root_url+box_score_tag.a.get("href")

#Opens the box-score page for the first game
browser2=mechanicalsoup.StatefulBrowser()
browser2.open(box_score_link)
boxscorepage=browser2.get_current_page()
browser2.close()

#attempt to scrape all the tables using Pandas
tables = pandas.read_html(box_score_link)
print(len(tables))

The output with the Pandas Function is 3 (i.e. pulling only 3 tables) when clearly there are many more.

BeautifulSoup Attempt (replace last 3 lines)

#attempt to scrape the specific table in question using BeautifulSoup
game_info = boxscorepage.find_all("table",{"id":"game_info"})
print(game_info)

This outputs nothing - On this page, finding some tags (divs, spans, etc) works but others don't. In this case, it's not finding the table with game_info as intended.

Upvotes: -1

Views: 200

Answers (1)

chitown88
chitown88

Reputation: 28565

No need to use Selenium. Those tables can be found within the comments of the html. Just pull those out and you can grab all the table tags. That particular table is the second table (at index position 1).

Code:

import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd


url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))

tables = []
for each in comments:
    if 'table' in each:
        try:
            tables.append(pd.read_html(each)[0])
        except:
            continue
        
print (tables[1].loc[1:])

Output:

print (tables[1].loc[1:])
            0                         1
1    Won Toss         Eagles (deferred)
2        Roof                  outdoors
3     Surface                     grass
4    Duration                      3:19
5  Attendance                     69696
6     Weather    81 degrees, wind 8 mph
7  Vegas Line  Philadelphia Eagles -1.0
8  Over/Under              44.5 (under)

Upvotes: 1

Related Questions