Reputation: 9
I am trying to scrape the second table (Year-by-Year Team Batting per Game) on this webpage, but I have only been able to scrape the first table (Year-by-Year Team Batting) I have researched a couple of different ways to scrape using BeautifulSoup, but have not be successful in getting the table. The code is below for the 2 methods I have tried. Any help, thoughts, or ideas would be very much appreciated!
#1
import requests
bat_stats_url = "https://www.baseball-reference.com/teams/PHI/batteam.shtml"
data_b = requests.get(bat_stats_url)
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(data_b.text)
bat_stats_table = soup.select('table.stats_table')[0]
import pandas as pd
bat_year_stats = pd.read_html(data_b.text, match = 'Year-by-Year Team Batting')
bat_year_stats[0]
#2
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.baseball-reference.com/teams/PHI/batteam.shtml'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36', 'Referer': 'https://www.nseindia.com/'}
r = requests.get(url, headers=headers)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
cols = row.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
bat_year_stats[0].columns.values.tolist()
df = df.iloc[1:]
df = pd.DataFrame(output, columns = ['Year','Lg','W','L','Finish','R/G','G','PA','AB','R','H',
'2B','3B','HR','RBI','SB','CS','BB','SO','BA','OBP','SLG','OPS','E','DP','Fld%'])
df = df.iloc[1:]
df
Upvotes: 0
Views: 135
Reputation: 25048
Table is stored as comment so pandas.read_html()
could not find it until you extract it:
soup.find_all(string=lambda text: isinstance(text, Comment))
then use the result to read your table:
pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="yby_team_bat_per_game"' in x][0])[0]
import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(requests.get('https://www.baseball-reference.com/teams/PHI/batteam.shtml').text)
pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="yby_team_bat_per_game"' in x][0])[0]
Year | Lg | W | L | Finish | G | PA | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | BA | OBP | SLG | OPS | E | DP | Fld% | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | NL East | 44 | 39 | 3 | 83 | 37.9 | 33.99 | 4.86 | 8.43 | 1.61 | 0.17 | 1.31 | 4.63 | 0.63 | 0.13 | 3.28 | 8.47 | 0.248 | 0.317 | 0.421 | 0.738 | 0.49 | 0.75 | 0.986 |
1 | 2021 | NL East | 82 | 80 | 2 | 162 | 37.6 | 33.12 | 4.53 | 7.95 | 1.62 | 0.15 | 1.22 | 4.32 | 0.48 | 0.12 | 3.48 | 8.65 | 0.24 | 0.318 | 0.408 | 0.726 | 0.58 | 0.88 | 0.984 |
2 | 2020 | NL East | 28 | 32 | 3 | 60 | 37.1 | 32.47 | 5.1 | 8.33 | 1.5 | 0.17 | 1.37 | 4.82 | 0.58 | 0.13 | 3.82 | 8 | 0.257 | 0.342 | 0.439 | 0.781 | 0.58 | 0.95 | 0.983 |
3 | 2019 | NL East | 81 | 81 | 4 | 162 | 38.6 | 34.39 | 4.78 | 8.45 | 1.92 | 0.16 | 1.33 | 4.58 | 0.48 | 0.11 | 3.47 | 8.97 | 0.246 | 0.319 | 0.427 | 0.746 | 0.6 | 0.84 | 0.984 |
4 | 2018 | NL East | 80 | 82 | 3 | 162 | 37.9 | 33.48 | 4.18 | 7.85 | 1.49 | 0.19 | 1.15 | 4.03 | 0.43 | 0.16 | 3.59 | 9.38 | 0.234 | 0.314 | 0.393 | 0.707 | 0.76 | 0.85 | 0.979 |
...
Upvotes: 2