Reputation: 35
I am trying to extract JSON keys and associated data nested in an API response into a Pandas DataFrame with each key and associated data element as a separate column.
I have tried the solution HERE: Parsing a JSON string which was loaded from a CSV using Pandas
But there are 2 problems. First, I have to convert the API request response into a CSV, and then back from a CSV to a DataFrame, which seems like a wasted step, but I was willing to do it if this would otherwise work.
Second, even when I do that, I get "JSON DecodeError: Expecting property name enclosed in double quotes."
I have also tried and failed with the solution described HERE: instructional at https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas
import requests
from pandas import DataFrame
import pandas as pd
import json
teamgamedata_url = 'https://api.collegefootballdata.com/games/teams?year=2019&week=1&seasonType=regular'
teamgamedataresp = requests.get(teamgamedata_url)
dftg = DataFrame(teamgamedataresp.json())
This works, but produces a column 'teams' with lots of nested data so to try to follow the csv advice, I convert to CSV
dftg.to_csv(r'/path/teamgameinfoapi.csv')
def CustomParser(data):
j1 = json.loads(data)
return j1
csvtodf = pd.read_csv('/path/teamgameinfoapi.csv',
converters={'teams':CustomParser}, header=0)
csvtodf[sorted(csvtodf['teams'][0].keys())] = csvtodf['teams'].apply(pd.Series)
I expected the CustomParser to extract the JSON data into separate columns, but get:
JSONDecodeError: Expecting property name enclosed in double quotes
I expected the last line of code to append the columns to the dataframe, but instead got:
KeyError: 'teams'
Upvotes: 0
Views: 1040
Reputation: 10960
Efficient and more pandas way of doing this
On pandas >= 0.25.1
teamgamedataresp = requests.get(teamgamedata_url)
d = teamgamedataresp.json()
# errors='ignore' used because some records may not have id, in that case it will throw error, I'm ignoring it here
teams_df = pd.io.json.json_normalize(d, ['teams'], ['id'], errors='ignore')
print(teams_df)
teams_df = teams_df.explode('stats')
print(teams_df)
stats_df = pd.io.json.json_normalize(teams_df['stats'])
print(stats_df)
teams_df.drop(columns=['stats'], inplace=True)
data = pd.concat([teams_df.reset_index(drop=True), stats_df.reset_index(drop=True)], axis=1)
print(data)
Normalizing first level of records i.e. teams
school conference homeAway points stats id
0 Vanderbilt SEC home 6 [{'category': 'rushingTDs', 'stat': '0'}, {'ca... 401110732
1 Georgia SEC away 30 [{'category': 'rushingTDs', 'stat': '2'}, {'ca... 401110732
2 Miami ACC home 20 [{'category': 'rushingTDs', 'stat': '1'}, {'ca... 401110723
3 Florida SEC away 24 [{'category': 'rushingTDs', 'stat': '1'}, {'ca... 401110723
4 Georgia State Sun Belt away 38 [{'category': 'rushingTDs', 'stat': '3'}, {'ca... 401110730
.. ... ... ... ... ... ...
163 Navy American Athletic home 45 [{'category': 'rushingTDs', 'stat': '6'}, {'ca... 401117857
164 Gardner-Webb None away 28 [{'category': 'rushingTDs', 'stat': '3'}, {'ca... 401135910
165 Charlotte Conference USA home 49 [{'category': 'rushingTDs', 'stat': '4'}, {'ca... 401135910
166 Alabama State None away 19 [{'category': 'rushingTDs', 'stat': '1'}, {'ca... 401114237
167 UAB Conference USA home 24 [{'category': 'rushingTDs', 'stat': '1'}, {'ca... 401114237
[168 rows x 6 columns]
Exploding list in stats column into rows
school conference homeAway points stats id
0 Vanderbilt SEC home 6 {'category': 'rushingTDs', 'stat': '0'} 401110732
0 Vanderbilt SEC home 6 {'category': 'passingTDs', 'stat': '0'} 401110732
0 Vanderbilt SEC home 6 {'category': 'kickReturnYards', 'stat': '35'} 401110732
0 Vanderbilt SEC home 6 {'category': 'kickReturnTDs', 'stat': '0'} 401110732
0 Vanderbilt SEC home 6 {'category': 'kickReturns', 'stat': '2'} 401110732
.. ... ... ... ... ... ...
167 UAB Conference USA home 24 {'category': 'netPassingYards', 'stat': '114'} 401114237
167 UAB Conference USA home 24 {'category': 'totalYards', 'stat': '290'} 401114237
167 UAB Conference USA home 24 {'category': 'fourthDownEff', 'stat': '0-1'} 401114237
167 UAB Conference USA home 24 {'category': 'thirdDownEff', 'stat': '1-13'} 401114237
167 UAB Conference USA home 24 {'category': 'firstDowns', 'stat': '16'} 401114237
[3927 rows x 6 columns]
Normalize the stats column to get a dataframe
category stat
0 rushingTDs 0
1 passingTDs 0
2 kickReturnYards 35
3 kickReturnTDs 0
4 kickReturns 2
... ... ...
3922 netPassingYards 114
3923 totalYards 290
3924 fourthDownEff 0-1
3925 thirdDownEff 1-13
3926 firstDowns 16
[3927 rows x 2 columns]
Finally, Merge both dataframes.
school conference homeAway points id category stat
0 Vanderbilt SEC home 6 401110732 rushingTDs 0
1 Vanderbilt SEC home 6 401110732 passingTDs 0
2 Vanderbilt SEC home 6 401110732 kickReturnYards 35
3 Vanderbilt SEC home 6 401110732 kickReturnTDs 0
4 Vanderbilt SEC home 6 401110732 kickReturns 2
... ... ... ... ... ... ... ...
3922 UAB Conference USA home 24 401114237 netPassingYards 114
3923 UAB Conference USA home 24 401114237 totalYards 290
3924 UAB Conference USA home 24 401114237 fourthDownEff 0-1
3925 UAB Conference USA home 24 401114237 thirdDownEff 1-13
3926 UAB Conference USA home 24 401114237 firstDowns 16
[3927 rows x 7 columns]
Upvotes: 1
Reputation: 35
I was unable to fit this into a comment; I apologize for breaking protocol. Hope I have followed you, @crayxt. Please let me know if I've misunderstood.
a = teamgamedataresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#Not sure what this means, but I think we're creating a bucket where a new dict can go
for game in a:
#defining game as the key unit for evaluation in our desired dataset
for team in game['teams']:
#telling python that for each unique game we want to identify each
#team identified by the 'teams' key, and . . .
game_dict = dict(id=game['id'])
#tells python that our key unit, game, is tied to the game ID#
for cat in ('school', 'conference', 'homeAway', 'points'):
#defining the categories where there is no further nesting to explore
game_dict[cat]=team[cat]
#we're making sure that python knows that the categories used for the game
#should be the same one used for the team.
for stat in team['stats']:
game_dict[stat['category']] = stat['stat']
#now we need to dive into the further nesting in the stats category.
#each team's statistics should be labeled with the JSON 'category" key
#and given a value equal to the "stat" value key given in JSON
buf.append(game_dict)
#adds the newly created stat columns to the dataset
Upvotes: 0
Reputation: 2405
I would rather drill down manually, as the structure is complex:
a = teamgamedataresp.json()
buf = []
for game in a:
for team in game['teams']:
game_dict = dict(id=game['id'])
for cat in ('school', 'conference', 'homeAway', 'points'):
game_dict[cat] = team[cat]
for stat in team['stats']:
game_dict[stat['category']] = stat['stat']
buf.append(game_dict)
>>> df = pd.DataFrame(buf)
>>> df
id school conference homeAway points ... puntReturnTDs puntReturns interceptionYards interceptionTDs passesIntercepted
0 401110732 Vanderbilt SEC home 6 ... NaN NaN NaN NaN NaN
1 401110732 Georgia SEC away 30 ... 0 4 NaN NaN NaN
2 401110723 Miami ACC home 20 ... 0 1 41 0 2
3 401110723 Florida SEC away 24 ... 0 3 NaN NaN NaN
4 401110730 Georgia State Sun Belt away 38 ... NaN NaN 0 0 1
.. ... ... ... ... ... ... ... ... ... ... ...
163 401117857 Navy American Athletic home 45 ... 0 1 NaN NaN NaN
164 401135910 Gardner-Webb None away 28 ... NaN NaN 45 0 3
165 401135910 Charlotte Conference USA home 49 ... 1 3 NaN NaN NaN
166 401114237 Alabama State None away 19 ... 0 2 NaN NaN NaN
167 401114237 UAB Conference USA home 24 ... 0 2 0 0 1
[168 rows x 31 columns]
Upvotes: 0