Reputation: 1
Background:
I'm writing code to scrape text from the following webpage
https://www.pro-football-reference.com/boxscores/201809060phi.htm.
The page contains several tables, and I'd like to extract specific data from them. To illustrate the example here, I'm looking to pull data from the row labeled "Roof" in the Table labeled "Game Info".
The Attempt:
I've attempted to scrape this information in two ways: using BeautifulSoup (Also importing MechanicalSoup in my case as I have additional code to open links in a for loop), and using the Pandas modules.
Pandas Attempt:
import mechanicalsoup
from bs4 import BeautifulSoup
import pandas
root_url="https://www.pro-football-reference.com"
#Opens the main pro-football page with list of 2018 games
browser=mechanicalsoup.StatefulBrowser()
browser.open("https://www.pro-football-reference.com/years/2018/games.htm")
main_page = browser.get_current_page()
browser.close()
data=main_page.find_all("tr")
#Finds the link to box-score information for the first game.
#Will iterate over all games in a for loop later on.
box_score_tag = data[1].find("td",{"data-stat":"boxscore_word"})
box_score_link = root_url+box_score_tag.a.get("href")
#Opens the box-score page for the first game
browser2=mechanicalsoup.StatefulBrowser()
browser2.open(box_score_link)
boxscorepage=browser2.get_current_page()
browser2.close()
#attempt to scrape all the tables using Pandas
tables = pandas.read_html(box_score_link)
print(len(tables))
The output with the Pandas Function is 3 (i.e. pulling only 3 tables) when clearly there are many more.
BeautifulSoup Attempt (replace last 3 lines)
#attempt to scrape the specific table in question using BeautifulSoup
game_info = boxscorepage.find_all("table",{"id":"game_info"})
print(game_info)
This outputs nothing - On this page, finding some tags (divs, spans, etc) works but others don't. In this case, it's not finding the table with game_info as intended.
Upvotes: -1
Views: 200
Reputation: 28565
No need to use Selenium. Those tables can be found within the comments of the html. Just pull those out and you can grab all the table tags. That particular table is the second table (at index position 1).
Code:
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except:
continue
print (tables[1].loc[1:])
Output:
print (tables[1].loc[1:])
0 1
1 Won Toss Eagles (deferred)
2 Roof outdoors
3 Surface grass
4 Duration 3:19
5 Attendance 69696
6 Weather 81 degrees, wind 8 mph
7 Vegas Line Philadelphia Eagles -1.0
8 Over/Under 44.5 (under)
Upvotes: 1