Reputation: 35

Parsing JSON strings from API with Pandas

I am trying to extract JSON keys and associated data nested in an API response into a Pandas DataFrame with each key and associated data element as a separate column.

I have tried the solution HERE: Parsing a JSON string which was loaded from a CSV using Pandas

But there are 2 problems. First, I have to convert the API request response into a CSV, and then back from a CSV to a DataFrame, which seems like a wasted step, but I was willing to do it if this would otherwise work.

Second, even when I do that, I get "JSON DecodeError: Expecting property name enclosed in double quotes."

I have also tried and failed with the solution described HERE: instructional at https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas

import requests
from pandas import DataFrame
import pandas as pd
import json

teamgamedata_url = 'https://api.collegefootballdata.com/games/teams?year=2019&week=1&seasonType=regular'

teamgamedataresp = requests.get(teamgamedata_url)

dftg = DataFrame(teamgamedataresp.json())

This works, but produces a column 'teams' with lots of nested data so to try to follow the csv advice, I convert to CSV

dftg.to_csv(r'/path/teamgameinfoapi.csv')

def CustomParser(data):
    j1 = json.loads(data)
    return j1

csvtodf = pd.read_csv('/path/teamgameinfoapi.csv', 
                      converters={'teams':CustomParser}, header=0)

csvtodf[sorted(csvtodf['teams'][0].keys())] = csvtodf['teams'].apply(pd.Series)

I expected the CustomParser to extract the JSON data into separate columns, but get:

JSONDecodeError: Expecting property name enclosed in double quotes

I expected the last line of code to append the columns to the dataframe, but instead got:

KeyError: 'teams'

Upvotes: 0

Answers (3)

Vishnudev Krishnadas

Reputation: 10960

Efficient and more pandas way of doing this

On pandas >= 0.25.1

teamgamedataresp = requests.get(teamgamedata_url)

d = teamgamedataresp.json()

# errors='ignore' used because some records may not have id, in that case it will throw error, I'm ignoring it here
teams_df = pd.io.json.json_normalize(d, ['teams'], ['id'], errors='ignore')
print(teams_df)

teams_df = teams_df.explode('stats')
print(teams_df)

stats_df = pd.io.json.json_normalize(teams_df['stats'])
print(stats_df)

teams_df.drop(columns=['stats'], inplace=True)
data = pd.concat([teams_df.reset_index(drop=True), stats_df.reset_index(drop=True)], axis=1)
print(data)

TL;DR (Data displayed for understanding purposes)

Normalizing first level of records i.e. teams

        school         conference homeAway  points                                              stats         id
0       Vanderbilt                SEC     home       6  [{'category': 'rushingTDs', 'stat': '0'}, {'ca...  401110732
1          Georgia                SEC     away      30  [{'category': 'rushingTDs', 'stat': '2'}, {'ca...  401110732
2            Miami                ACC     home      20  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401110723
3          Florida                SEC     away      24  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401110723
4    Georgia State           Sun Belt     away      38  [{'category': 'rushingTDs', 'stat': '3'}, {'ca...  401110730
..             ...                ...      ...     ...                                                ...        ...
163           Navy  American Athletic     home      45  [{'category': 'rushingTDs', 'stat': '6'}, {'ca...  401117857
164   Gardner-Webb               None     away      28  [{'category': 'rushingTDs', 'stat': '3'}, {'ca...  401135910
165      Charlotte     Conference USA     home      49  [{'category': 'rushingTDs', 'stat': '4'}, {'ca...  401135910
166  Alabama State               None     away      19  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401114237
167            UAB     Conference USA     home      24  [{'category': 'rushingTDs', 'stat': '1'}, {'ca...  401114237

[168 rows x 6 columns]

Exploding list in stats column into rows

         school      conference homeAway  points                                           stats         id
0    Vanderbilt             SEC     home       6         {'category': 'rushingTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6         {'category': 'passingTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6   {'category': 'kickReturnYards', 'stat': '35'}  401110732
0    Vanderbilt             SEC     home       6      {'category': 'kickReturnTDs', 'stat': '0'}  401110732
0    Vanderbilt             SEC     home       6        {'category': 'kickReturns', 'stat': '2'}  401110732
..          ...             ...      ...     ...                                             ...        ...
167         UAB  Conference USA     home      24  {'category': 'netPassingYards', 'stat': '114'}  401114237
167         UAB  Conference USA     home      24       {'category': 'totalYards', 'stat': '290'}  401114237
167         UAB  Conference USA     home      24    {'category': 'fourthDownEff', 'stat': '0-1'}  401114237
167         UAB  Conference USA     home      24    {'category': 'thirdDownEff', 'stat': '1-13'}  401114237
167         UAB  Conference USA     home      24        {'category': 'firstDowns', 'stat': '16'}  401114237

[3927 rows x 6 columns]

Normalize the stats column to get a dataframe

             category  stat
0          rushingTDs     0
1          passingTDs     0
2     kickReturnYards    35
3       kickReturnTDs     0
4         kickReturns     2
...               ...   ...
3922  netPassingYards   114
3923       totalYards   290
3924    fourthDownEff   0-1
3925     thirdDownEff  1-13
3926       firstDowns    16

[3927 rows x 2 columns]

Finally, Merge both dataframes.

          school      conference homeAway  points         id         category  stat
0     Vanderbilt             SEC     home       6  401110732       rushingTDs     0
1     Vanderbilt             SEC     home       6  401110732       passingTDs     0
2     Vanderbilt             SEC     home       6  401110732  kickReturnYards    35
3     Vanderbilt             SEC     home       6  401110732    kickReturnTDs     0
4     Vanderbilt             SEC     home       6  401110732      kickReturns     2
...          ...             ...      ...     ...        ...              ...   ...
3922         UAB  Conference USA     home      24  401114237  netPassingYards   114
3923         UAB  Conference USA     home      24  401114237       totalYards   290
3924         UAB  Conference USA     home      24  401114237    fourthDownEff   0-1
3925         UAB  Conference USA     home      24  401114237     thirdDownEff  1-13
3926         UAB  Conference USA     home      24  401114237       firstDowns    16

[3927 rows x 7 columns]

Upvotes: 1

janalytics

Reputation: 35

I was unable to fit this into a comment; I apologize for breaking protocol. Hope I have followed you, @crayxt. Please let me know if I've misunderstood.

a = teamgamedataresp.json()
#defining a as the response to our get request for this data, in JSON format 
buf = []
#Not sure what this means, but I think we're creating a bucket where a new dict can go
for game in a:
    #defining game as the key unit for evaluation in our desired dataset
    for team in game['teams']:
        #telling python that for each unique game we want to identify each
        #team identified by the 'teams' key, and  . . . 
        game_dict = dict(id=game['id'])
        #tells python that our key unit, game, is tied to the game ID#
        for cat in ('school', 'conference', 'homeAway', 'points'):
            #defining the categories where there is no further nesting to explore
            game_dict[cat]=team[cat]
            #we're making sure that python knows that the categories used for the game
            #should be the same one used for the team.
        for stat in team['stats']:
            game_dict[stat['category']] = stat['stat']
            #now we need to dive into the further nesting in the stats category.
            #each team's statistics should be labeled with the JSON 'category" key
            #and given a value equal to the "stat" value key given in JSON
        buf.append(game_dict)
        #adds the newly created stat columns to the dataset

Upvotes: 0

crayxt

Reputation: 2405

I would rather drill down manually, as the structure is complex:

a = teamgamedataresp.json()
buf = []
for game in a:
    for team in game['teams']:
        game_dict = dict(id=game['id'])
        for cat in ('school', 'conference', 'homeAway', 'points'):
            game_dict[cat] = team[cat]
        for stat in team['stats']:
            game_dict[stat['category']] = stat['stat']
        buf.append(game_dict)

>>> df = pd.DataFrame(buf)
>>> df
            id         school         conference homeAway  points  ... puntReturnTDs puntReturns interceptionYards interceptionTDs passesIntercepted
0    401110732     Vanderbilt                SEC     home       6  ...           NaN         NaN               NaN             NaN               NaN
1    401110732        Georgia                SEC     away      30  ...             0           4               NaN             NaN               NaN
2    401110723          Miami                ACC     home      20  ...             0           1                41               0                 2
3    401110723        Florida                SEC     away      24  ...             0           3               NaN             NaN               NaN
4    401110730  Georgia State           Sun Belt     away      38  ...           NaN         NaN                 0               0                 1
..         ...            ...                ...      ...     ...  ...           ...         ...               ...             ...               ...
163  401117857           Navy  American Athletic     home      45  ...             0           1               NaN             NaN               NaN
164  401135910   Gardner-Webb               None     away      28  ...           NaN         NaN                45               0                 3
165  401135910      Charlotte     Conference USA     home      49  ...             1           3               NaN             NaN               NaN
166  401114237  Alabama State               None     away      19  ...             0           2               NaN             NaN               NaN
167  401114237            UAB     Conference USA     home      24  ...             0           2                 0               0                 1

[168 rows x 31 columns]

Upvotes: 0

Parsing JSON strings from API with Pandas

Answers (3)

TL;DR (Data displayed for understanding purposes)

Related Questions