Reputation: 57
I am attempting to scrape the following page using Python (currently trying to use Requests & BeautifulSoup) but struggling to obtain a) meaningful results in a tabular format and b) scrape from each page as most player's data covers various pages (e.g., the following player had data spanning 7 pages: https://www.nba.com/stats/player/203081/head-to-head/ )
At the moment, I've been able to run a succesful GET & SOUP request but am unsure the best way to proceed. Any help/suggestions/recommendations are greatly appreciated.
url = 'https://www.nba.com/stats/player/203081/head-to-head/'
r = requests.get(url)
if r.status_code==200:
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)
table = soup.find('table')
if table:
df = pd.read_html(str(table))[0]
print(df)
Upvotes: 0
Views: 237
Reputation: 10809
I visited the page in my browser and logged my network traffic, and saw that my browser made several HTTP GET requests to REST APIs. One of them has the endpoint stats/leagueseasonmatchups
, which you can query with a specific player, league and season. The response is JSON which contains all the table information you're trying to scrape. Normally, this API is used by the page to populate the DOM asynchronously using JavaScript. Since we know the endpoint, query-string parameters and request headers, we can imitate that HTTP GET request, parse the response, and write it to a CSV:
def get_matchups():
import requests
url = "https://stats.nba.com/stats/leagueseasonmatchups"
params = {
"DateFrom": "",
"DateTo": "",
"DefPlayerID": "203081",
"LeagueID": "00",
"Outcome": "",
"PORound": "0",
"PerMode": "Totals",
"Season": "2020-21",
"SeasonType": "Regular Season"
}
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate",
"Referer": "https://www.nba.com/",
"User-Agent": "Mozilla/5.0",
"x-nba-stats-origin": "stats",
"x-nba-stats-token": "true"
}
print("Getting matchups for player ID# {}...".format(params["DefPlayerID"]))
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
data = response.json()
fieldnames = data["resultSets"][0]["headers"]
for row in data["resultSets"][0]["rowSet"]:
yield dict(zip(fieldnames, row))
def main():
from csv import DictWriter
all_matchups = list(get_matchups())
print("Writing to CSV file...")
with open("output.csv", "w", newline="") as file:
fieldnames = list(all_matchups[0]) # a bit lame
writer = DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for matchup in all_matchups:
writer.writerow(matchup)
print("Done.")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output (Terminal):
Getting matchups for player ID# 203081...
Writing to CSV file...
Done.
>>>
Output (CSV):
SEASON_ID,OFF_PLAYER_ID,OFF_PLAYER_NAME,DEF_PLAYER_ID,DEF_PLAYER_NAME,GP,MATCHUP_MIN,PARTIAL_POSS,PLAYER_PTS,TEAM_PTS,MATCHUP_AST,MATCHUP_TOV,MATCHUP_BLK,MATCHUP_FGM,MATCHUP_FGA,MATCHUP_FG_PCT,MATCHUP_FG3M,MATCHUP_FG3A,MATCHUP_FG3_PCT,HELP_BLK,HELP_FGM,HELP_FGA,HELP_FG_PERC,MATCHUP_FTM,MATCHUP_FTA,SFL
22020,202709,Cory Joseph,203081,Damian Lillard,5,17:34,68.6,4,82,1,1,0,2,10,0.2,0,3,0.0,0,0,0,0.0,0,0,0
22020,1628969,Mikal Bridges,203081,Damian Lillard,3,17:28,68.36,18,98,4,1,0,7,8,0.875,3,4,0.75,0,0,0,0.0,1,1,1
22020,1628366,Lonzo Ball,203081,Damian Lillard,3,16:34,65.98,17,77,6,2,1,6,13,0.462,5,11,0.455,0,0,0,0.0,0,0,0
22020,1626220,Royce O'Neale,203081,Damian Lillard,3,14:17,51.4,2,77,0,1,0,1,6,0.167,0,4,0.0,0,0,0,0.0,0,0,0
22020,1626196,Josh Richardson,203081,Damian Lillard,3,11:39,47.9,6,80,2,1,0,2,4,0.5,1,1,1.0,0,0,0,0.0,1,1,1
...
Upvotes: 1