Reputation: 3208
I'm trying to use pandas.DataFrame.groupby['x']
in order to make calculation on the grouped df
, by x
.
Problem arise when 'x'
repeats more then once. apply function will do the calculations as many times as 'x'
repeats, though I only need the 'aggregated' values (It's not really an aggregation but more like - processing).
Here's a toy example:
def simulate_complicated_func(df):
# This function simulates complicate calculations
returned_col_names = ['calc1', 'calc2', 'calc3']
df['calc1'] = ''.join(df['var1'])
df['calc2'] = df['var2'].mean()
df['calc3'] = ''.join(df['var1']) + str(df['var2'].max())
return df[['id'] + returned_col_names]
df = pd.DataFrame({'id':['id1', 'id1', 'id2', 'id3', 'id3', 'id3'],
'var1':['abc', 'cba', 'abc', 'cba', 'abc', 'cba'],
'var2':[9, 4, 7, 4, 1, 3]})
print(df)
id var1 var2
0 id1 abc 9
1 id1 cba 4
2 id2 abc 7
3 id3 cba 4
4 id3 abc 1
5 id3 cba 3
res_df = df.groupby(['id']).apply(simulate_complicated_func).drop_duplicates()
print(res_df)
id calc1 calc2 calc3
0 id1 abccba 6.500000 abccba9
2 id2 abc 7.000000 abc7
3 id3 cbaabccba 2.666667 cbaabccba4
The output is exactly what I want, but it's not efficient. Is there a better way doing it using pandas?
If we'll add a print
statement to simulate_complicated_func()
def simulate_complicated_func(df):
# This function simulates complicate calculations
print("function called")
# ...
We can see that the code will print it 6 times:
function called
function called
function called
function called
function called
function called
In reality, we only need to access this function 3 times (the number of groups created by groupby).
Upvotes: 3
Views: 120
Reputation: 862511
One idea is return Series
from custom function, so drop_duplicates
is not necessary:
def simulate_complicated_func(df):
# This function simulates complicate calculations
returned_col_names = ['calc1', 'calc2', 'calc3']
a = ''.join(df['var1'])
b = df['var2'].mean()
c = ''.join(df['var1']) + str(df['var2'].max())
return pd.Series([a,b,c], index=returned_col_names)
res_df = df.groupby(['id']).apply(simulate_complicated_func).reset_index()
print(res_df)
id calc1 calc2 calc3
0 id1 abccba 6.500000 abccba9
1 id2 abc 7.000000 abc7
2 id3 cbaabccba 2.666667 cbaabccba4
Another idea is use DataFrameGroupBy.agg
but it is possible only for processing all columns with aggregated functions like join
and mean
. Function agg
working with each column separately, so cal3
is not possible easy/effective way count - is necessary again custom function and last join output together:
def simulate_complicated_func(df):
# This function simulates complicate calculations
returned_col_names = ['calc3']
c = ''.join(df['var1']) + str(df['var2'].max())
return pd.Series([c], index=returned_col_names)
d = {'var1': ''.join, 'var2':'mean'}
cols = {'var1':'calc1','var2':'calc2'}
g = df.groupby(['id'])
df1 = g.agg(d).rename(columns=cols)
print (df1)
calc1 calc2
id
id1 abccba 6.500000
id2 abc 7.000000
id3 cbaabccba 2.666667
df2 = df.groupby(['id']).apply(simulate_complicated_func)
print(df2)
calc3
id
id1 abccba9
id2 abc7
id3 cbaabccba4
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
id calc1 calc2 calc3
0 id1 abccba 6.500000 abccba9
1 id2 abc 7.000000 abc7
2 id3 cbaabccba 2.666667 cbaabccba4
Upvotes: 2