Reputation: 29
I am looking to build a dataframe utilizing certain data from a JSON with multiple nested dictionaries.
An example of the format of the JSON is as follows:
"leagueYear": "2021",
"stats": {
"Cincinnati Bengals": {
"offense": {
"firstDownPerG": {
"value": "19.82",
"rank": "18"
},
"thirdDownPct": {
"rank": "16",
"value": "39.6%"
},
"g": {
"value": "17"
}
}
"defense": {
"tacklesLoss": {
"rank": "9",
"value": "80"
},
"passDefended": {
"rank": "24",
"value": "63"
}
}
"Carolina Panthers": {
"offense": {
"fumbles": {
"value": "20",
"rank": "13"
},
"passIntPerc": {
"value": "3.5",
"rank": "31"
}
}
}
I am able to format to dataframe with the following code:
with open('2021.json') as json_file:
team_dict = json.load(json_file)
print(team_dict.keys())
print(team_dict['leagueYear'])
teams = team_dict['stats']
df = pd.DataFrame()
df['Teams'] = teams.keys()
To get a dataframe that looks like:
Teams | |
---|---|
0 | Cincinati Bengals |
1 | Buffalo Bills |
What I need to do is write a function that will pull selected keys based on team names as the index value.
So for the games value I can use...
print(teams['Buffalo Bills']['offense']['g']['value'])
...to get the value of how many games the Bills played, however I would like to write a function that would look at the value of the team name and pull the corresponding values. I need to pull about 20 values from each team, so having a way to iterate over the df rows as a function would be very useful.
My end goal is to write a function(s) that make my df look like:
Teams | Games Played | First Down Per Game | |
---|---|---|---|
0 | Cincinati Bengals | 17 | 19.82 |
1 | Buffalo Bills | 17 | Whatever it is |
I'm sure this is a relatively easy ask, but I am very new to using JSON and building dataframes.
My closest guess, I don't think the function or the df.apply is right:
def games_played(game):
team = df['Team']
games_played = game[team]['offense']['g']['value']
return games_played
df['Games Played'] = df.apply(lambda games_played: ) <--- getting lost here
Upvotes: 0
Views: 460
Reputation: 13242
You should probably look at pd.json_normalize
before trying to reinvent the wheel:
data = {'leagueYear': '2021', 'stats': {'Cincinnati Bengals': {'offense': {'firstDownPerG': {'value': '19.82', 'rank': '18'}, 'thirdDownPct': {'rank': '16', 'value': '39.6%'}, 'g': {'value': '17'}}}}}
df = pd.json_normalize(data)
print(df)
Output:
leagueYear stats.Cincinnati Bengals.offense.firstDownPerG.value stats.Cincinnati Bengals.offense.firstDownPerG.rank stats.Cincinnati Bengals.offense.thirdDownPct.rank stats.Cincinnati Bengals.offense.thirdDownPct.value stats.Cincinnati Bengals.offense.g.value
0 2021 19.82 18 16 39.6% 1
Given (I had to clean up what you pasted a little to be valid JSON):
data = {
"leagueYear": "2021",
"stats": {
"Cincinnati Bengals": {
"offense": {
"firstDownPerG": {
"value": "19.82",
"rank": "18"
},
"thirdDownPct": {
"rank": "16",
"value": "39.6%"
},
"g": {
"value": "17"
}
},
"defense": {
"tacklesLoss": {
"rank": "9",
"value": "80"
},
"passDefended": {
"rank": "24",
"value": "63"
}
}
},
"Carolina Panthers": {
"offense": {
"fumbles": {
"value": "20",
"rank": "13"
},
"passIntPerc": {
"value": "3.5",
"rank": "31"
}
}
}
}
}
Doing:
# Stats will work better if it's a list, and formatted a little differently:
data['stats'] = [{'team': k} | v for k,v in data['stats'].items()]
# This makes data look like:
{
"leagueYear": "2021",
"stats": [
{
"team": "Cincinnati Bengals",
"offense": {
"firstDownPerG": {
"value": "19.82",
"rank": "18"
},
"thirdDownPct": {
"rank": "16",
"value": "39.6%"
},
"g": {
"value": "17"
}
},
"defense": {
"tacklesLoss": {
"rank": "9",
"value": "80"
},
"passDefended": {
"rank": "24",
"value": "63"
}
}
},
{
"team": "Carolina Panthers",
"offense": {
"fumbles": {
"value": "20",
"rank": "13"
},
"passIntPerc": {
"value": "3.5",
"rank": "31"
}
}
}
]
}
Now, if we use pd.json_normalize
...
df = pd.json_normalize(data, ['stats'])
print(df)
# Output:
team offense.firstDownPerG.value offense.firstDownPerG.rank offense.thirdDownPct.rank offense.thirdDownPct.value offense.g.value defense.tacklesLoss.rank defense.tacklesLoss.value defense.passDefended.rank defense.passDefended.value offense.fumbles.value offense.fumbles.rank offense.passIntPerc.value offense.passIntPerc.rank
0 Cincinnati Bengals 19.82 18 16 39.6% 17 9 80 24 63 NaN NaN NaN NaN
1 Carolina Panthers NaN NaN NaN NaN NaN NaN NaN NaN NaN 20 13 3.5 31
We can do some nifty things to clean this up, one method might be:
df = df.set_index('team')
# Make a multi-index
df.columns = df.columns.str.split('.', expand=True)
df = df.sort_index(axis=1)
print(df)
# Output:
defense offense
passDefended tacklesLoss firstDownPerG fumbles g passIntPerc thirdDownPct
rank value rank value rank value rank value value rank value rank value
team
Cincinnati Bengals 24 63 9 80 18 19.82 NaN NaN 17 NaN NaN 16 39.6%
Carolina Panthers NaN NaN NaN NaN NaN NaN 13 20 NaN 31 3.5 NaN NaN
Now you have a nifty dataframe that you can do many things with:
# For example, Perhaps you want just a dataframe with defense ranks:
defense_rank = df.stack([-3, -1]).xs(('defense', 'rank'), axis=0, level=[-2, -1]).dropna(axis=1)
print(defense_rank)
# Output:
passDefended tacklesLoss
team
Cincinnati Bengals 24 9
Upvotes: 1