Pappy Oh
Pappy Oh

Reputation: 9

Scraping the Second table on a web site

I am trying to scrape the second table (Year-by-Year Team Batting per Game) on this webpage, but I have only been able to scrape the first table (Year-by-Year Team Batting) I have researched a couple of different ways to scrape using BeautifulSoup, but have not be successful in getting the table. The code is below for the 2 methods I have tried. Any help, thoughts, or ideas would be very much appreciated!

#1

import requests
bat_stats_url = "https://www.baseball-reference.com/teams/PHI/batteam.shtml"
data_b = requests.get(bat_stats_url)
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(data_b.text)
bat_stats_table = soup.select('table.stats_table')[0]

import pandas as pd
​
bat_year_stats = pd.read_html(data_b.text, match = 'Year-by-Year Team Batting')
bat_year_stats[0]

#2

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.baseball-reference.com/teams/PHI/batteam.shtml'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36', 'Referer': 'https://www.nseindia.com/'}
r = requests.get(url,  headers=headers)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
    cols = row.find_all('td')
    cols = [item.text.strip() for item in cols]
    output.append([item for item in cols if item])
​
bat_year_stats[0].columns.values.tolist()
df = df.iloc[1:]
df = pd.DataFrame(output, columns = ['Year','Lg','W','L','Finish','R/G','G','PA','AB','R','H',
 '2B','3B','HR','RBI','SB','CS','BB','SO','BA','OBP','SLG','OPS','E','DP','Fld%'])
df = df.iloc[1:]
df

Upvotes: 0

Views: 135

Answers (1)

HedgeHog
HedgeHog

Reputation: 25048

Table is stored as comment so pandas.read_html() could not find it until you extract it:

soup.find_all(string=lambda text: isinstance(text, Comment))

then use the result to read your table:

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="yby_team_bat_per_game"' in x][0])[0]
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(requests.get('https://www.baseball-reference.com/teams/PHI/batteam.shtml').text)

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="yby_team_bat_per_game"' in x][0])[0]
Output
Year Lg W L Finish G PA AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG OPS E DP Fld%
0 2022 NL East 44 39 3 83 37.9 33.99 4.86 8.43 1.61 0.17 1.31 4.63 0.63 0.13 3.28 8.47 0.248 0.317 0.421 0.738 0.49 0.75 0.986
1 2021 NL East 82 80 2 162 37.6 33.12 4.53 7.95 1.62 0.15 1.22 4.32 0.48 0.12 3.48 8.65 0.24 0.318 0.408 0.726 0.58 0.88 0.984
2 2020 NL East 28 32 3 60 37.1 32.47 5.1 8.33 1.5 0.17 1.37 4.82 0.58 0.13 3.82 8 0.257 0.342 0.439 0.781 0.58 0.95 0.983
3 2019 NL East 81 81 4 162 38.6 34.39 4.78 8.45 1.92 0.16 1.33 4.58 0.48 0.11 3.47 8.97 0.246 0.319 0.427 0.746 0.6 0.84 0.984
4 2018 NL East 80 82 3 162 37.9 33.48 4.18 7.85 1.49 0.19 1.15 4.03 0.43 0.16 3.59 9.38 0.234 0.314 0.393 0.707 0.76 0.85 0.979

...

Upvotes: 2

Related Questions