Reputation: 1048
How can I add a new column to a dataframe with the sum of values from a nested dataframe, without losing any of the other columns and nested data using pandas?
Specifically, I want to create a new column total_cost
with the sum of all nested dataframes for a row.
I managed to create the following dataframe using a series of groupby
and apply
:
user_id description unit_summary
0 111 xxx [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1 222 xxx [{'total_period_cost': 100, 'unit_id': 'yyy', ...
I'm trying to add the column total_cost
which is the sum of total_period_cost
of each nested dataframe (grouped by user_id
). How can I achieve the following dataframe?
user_id description total_cost unit_summary
0 111 xxx 300 [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1 222 xxx 100 [{'total_period_cost': 100, 'unit_id': 'yyy', ...
import pandas as pd
series = [{
"user_id":"111",
"description": "xxx",
"unit_summary":[
{
"total_period_cost":100,
"unit_id":"xxx",
"cost_per_unit":50,
"total_period_usage":2
},
{
"total_period_cost":200,
"unit_id":"yyy",
"cost_per_unit":25,
"total_period_usage": 8
}
]
},
{
"user_id":"222",
"description": "xxx",
"unit_summary":[
{
"total_period_cost":100,
"unit_id":"yyy",
"cost_per_unit":25,
"total_period_usage": 4
}
]
}]
df = pd.DataFrame(series)
print(df)
print(df.to_dict(orient='records'))
series
JSON object:import pandas as pd
series = [
{"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
{"user_id":"111", "unit_id":"xxx","cost_per_unit":50, "total_period_usage": 1},
{"user_id":"111", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 8},
{"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 3},
{"user_id":"222", "unit_id":"yyy","cost_per_unit":25, "total_period_usage": 1}
]
df = pd.DataFrame(series)
sumc = (
df.groupby(['user_id', 'unit_id', 'cost_per_unit'], as_index=False)
.agg({'total_period_usage': 'sum'})
)
sumc['total_period_cost'] = sumc.total_period_usage * sumc.cost_per_unit
sumc = (
sumc.groupby(['user_id'])
.apply(lambda x: x[['total_period_cost', 'unit_id', 'cost_per_unit', 'total_period_usage']].to_dict('r'))
.reset_index()
)
sumc = sumc.rename(columns={0:'unit_summary'})
sumc['description'] = 'xxx'
print(sumc)
print(sumc.to_dict(orient='records'))
def myf(x):
return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
# Sum all server sumbscriptions total_period_cost
sumc['total_period_cost'] = sumc['unit_summary'].apply(myf)
Upvotes: 1
Views: 965
Reputation: 75080
You can read each row in the unit_summary
column as a dataframe and sum the desired column:
Method1: apply
def myf(x):
return pd.DataFrame(x).loc[:,'total_period_cost'].sum()
df['total_cost'] = df['unit_summary'].apply(myf)
print(df)
Method2: similarly via list comprehension:
df['total_cost'] = [pd.DataFrame(i)['total_period_cost'].sum()
for i in df['unit_summary'].tolist()]
Method3: using explode
:
m = df['unit_summary'].explode()
df['total_cost'] = pd.DataFrame(m.tolist(),index=m.index)['total_period_cost'].sum(level=0)
user_id description unit_summary \
0 111 xxx [{'total_period_cost': 100, 'unit_id': 'xxx', ...
1 222 xxx [{'total_period_cost': 100, 'unit_id': 'yyy', ...
total_cost
0 300
1 100
In addition to the above, starting from your original dataframe, we can also do something like below to achieve the desired output, however this wouldnot give you the series with the dicts ('unit_summary`):
(df.assign(total_cost=df['cost_per_unit']*df['total_period_usage'])
.groupby(['user_id'],as_index=False)['total_cost'].sum().assign(description='xxxx'))
user_id total_cost description
0 111 300 xxxx
1 222 100 xxxx
Upvotes: 3