Reputation: 4609
I have a grouped pandas dataframe. I want to aggregate multiple columns. For each column, there are multiple aggregate functions. This is pretty straightforward. The tricky part is that in each aggregate function, I want to access data in another column.
How would I go about doing this efficiently? Here's the code I already have:
import pandas
data = [
{
'id': 1,
'A': 1,
'B': 1,
'C': 1,
'D': 1,
'E': 1,
'F': 1,
},
{
'id': 1,
'A': 2,
'B': 2,
'C': 2,
'D': 2,
'E': 2,
'F': 2,
},
{
'id': 2,
'A': 3,
'B': 3,
'C': 3,
'D': 3,
'E': 3,
'F': 3,
},
{
'id': 2,
'A': 4,
'B': 4,
'C': 4,
'D': 4,
'E': 4,
'F': 4,
},
]
df = pandas.DataFrame.from_records(data)
def get_column(column, column_name):
return df.iloc[column.index][column_name]
def agg_sum_a_b(column_a):
return column_a.sum() + get_column(column_a, 'B').sum()
def agg_sum_a_b_divide_c(column_a):
return (column_a.sum() + get_column(column_a, 'B').sum()) / get_column(column_a, 'C').sum()
def agg_sum_d_divide_sum_e_f(column_d):
return column_d.sum() / (get_column(column_d, 'E').sum() + get_column(column_d, 'F').sum())
def multiply_then_sum(column_e):
return (column_e * get_column(column_e, 'F')).sum()
df_grouped = df.groupby('id')
df_agg = df_grouped.agg({
'A': [agg_sum_a_b, agg_sum_a_b_divide_c, 'sum'],
'D': [agg_sum_d_divide_sum_e_f, 'sum'],
'E': [multiply_then_sum]
})
This code produces this dataframe:
A D E
agg_sum_a_b agg_sum_a_b_divide_c sum agg_sum_d_divide_sum_e_f sum multiply_then_sum
id
1 6 2 3 0.5 3 5
2 14 2 7 0.5 7 25
Am I doing this correctly? Is there a better way of doing this? I find the way I access data in another column within the aggregate function a little awkward.
The real data and code I'm using has about 20 columns and around 40 aggregate functions. There could potentially be hundreds of groups as well with each group having hundreds of rows.
When I do this using the real data and aggregate functions, it can take several minutes which is too slow for my purposes. Any way to make this more efficient?
Edit: I'm using Python 3.6 and pandas 0.23.0 btw. Thanks!
Edit 2: Added an example where I don't call sum()
on the columns.
Upvotes: 2
Views: 90
Reputation: 29635
First I think you need more apply
than agg
to access different columns at once. Here is an idea how to change a bit what you want to do. Let's first create a function regrouping the operation you want to do and return them as a list of results:
def operations_to_perfom (df_g):
df_g_sum = df_g.sum() #can do the same with mean, min, max ...
# return all the operation you want
return [ df_g_sum['A'] + df_g_sum['B'],
(df_g_sum['A'] + df_g_sum['B'])/df_g_sum['C'],
df_g_sum['A'],
float(df_g_sum['D'])/(df_g_sum['E']+df_g_sum['F']),
(df_g['E']*df_g['F']).sum() ]
#use apply to create a serie with id as index and a list of agg
df_values = df.groupby('id').apply(operations_to_perfom)
# now create the result dataframe from df_values with tolist() and index
df_agg = pd.DataFrame( df_values.tolist(), index=df_values.index,
columns=pd.MultiIndex.from_arrays([['A']*3+['D']+['E'],
['agg_sum_a_b', 'agg_sum_a_b_div_c' ,'sum', 'agg_sum_d_div_sum_e_f', 'e_mult_f']]))
and df_agg
looks like:
A D E
agg_sum_a_b agg_sum_a_b_div_c sum agg_sum_d_div_sum_e_f e_mult_f
id
1 6 2 3 0.5 5
2 14 2 7 0.5 25
Upvotes: 1