Reputation: 1862
I found Greg Reda's blog post about scraping HTML from nba.com:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
I tried to work with the code he wrote there:
import requests
import json
url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
'=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
response = requests.get(url)
response.raise_for_status()
shots = response.json()['resultSets']['rowSet']
avg_percentage = shots['OPP_FG_PCT']
print(avg_percentage)
But it returns:
Traceback (most recent call last):
File "C:\Python34\nba.py", line 91, in <module>
avg_percentage = shots['OPP_FG_PCT']
TypeError: list indices must be integers, not str
I know only basic Python so I couldn't figure out how to get a list of integers from the data. Can anybody explain?
Upvotes: 2
Views: 2213
Reputation: 12239
Evidently the data structure has changed since Greg Reda wrote that post. Before exploring the data, I recommend that you save it to a file via pickling. That way you don't have to keep hitting the NBA server and waiting for a download each time you modify and rerun the script.
The following script checks for the existence of the pickled data to avoid unnecessary downloading:
import requests
import json
url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \
'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \
'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \
'=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \
'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=&Seas' + \
'onType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
print(url)
import sys, os, pickle
file_name = 'result_sets.pickled'
if os.path.isfile(file_name):
result_sets = pickle.load(open(file_name, 'rb'))
else:
response = requests.get(url)
response.raise_for_status()
result_sets = response.json()['resultSets']
pickle.dump(result_sets, open(file_name, 'wb'))
print(result_sets.keys())
print(result_sets['headers'][1])
print(result_sets['rowSet'][0])
print(len(result_sets['rowSet']))
Once you have result_sets
in hand, you can examine the data. If you print it, you'll see that it's a dictionary. You can extract the dictionary keys:
print(result_sets.keys())
Currently the keys are 'headers'
, 'rowSet'
, and 'name'
. You can inspect the headers:
print(result_sets['headers'])
I probably know less about these statistics than you do. However, by looking at the data, I've been able to figure out that result_sets['rowSet']
contains 30 rows of 23 elements each. The 23 columns are identified by result_sets['headers'][1]
. Try this:
print(result_sets['headers'][1])
That will show you the 23 column names. Now take a look at the first row of team data:
print(result_sets['rowSet'][0])
Now you see the 23 values reported for the Atlanta Hawks. You can iterate over the rows in result_sets['rowSet']
to extract whatever values interest you and to compute aggregate information such as totals and averages.
Upvotes: 4