Pandas groupby and apply - getting a new DataFrame over the groupby variable

Question

I'm trying to use pandas.DataFrame.groupby['x'] in order to make calculation on the grouped df, by x.

Problem arise when 'x' repeats more then once. apply function will do the calculations as many times as 'x' repeats, though I only need the 'aggregated' values (It's not really an aggregation but more like - processing).

Here's a toy example:

def simulate_complicated_func(df):
    # This function simulates complicate calculations
    returned_col_names = ['calc1', 'calc2', 'calc3']

    df['calc1'] = ''.join(df['var1'])
    df['calc2'] = df['var2'].mean()
    df['calc3'] = ''.join(df['var1']) + str(df['var2'].max())

    return df[['id'] + returned_col_names]

df = pd.DataFrame({'id':['id1', 'id1', 'id2', 'id3', 'id3', 'id3'],
                   'var1':['abc', 'cba', 'abc', 'cba', 'abc', 'cba'],
                   'var2':[9, 4, 7, 4, 1, 3]})

print(df)

    id var1  var2
0  id1  abc     9
1  id1  cba     4
2  id2  abc     7
3  id3  cba     4
4  id3  abc     1
5  id3  cba     3

res_df = df.groupby(['id']).apply(simulate_complicated_func).drop_duplicates()
print(res_df)

    id      calc1     calc2       calc3
0  id1     abccba  6.500000     abccba9
2  id2        abc  7.000000        abc7
3  id3  cbaabccba  2.666667  cbaabccba4

The output is exactly what I want, but it's not efficient. Is there a better way doing it using pandas?

Edit: Optimize how?

If we'll add a print statement to simulate_complicated_func()

def simulate_complicated_func(df):
    # This function simulates complicate calculations
    print("function called")
    # ...

We can see that the code will print it 6 times:

function called
function called
function called
function called
function called
function called

In reality, we only need to access this function 3 times (the number of groups created by groupby).

jezrael · Accepted Answer

One idea is return Series from custom function, so drop_duplicates is not necessary:

def simulate_complicated_func(df):
    # This function simulates complicate calculations
    returned_col_names = ['calc1', 'calc2', 'calc3']

    a = ''.join(df['var1'])
    b = df['var2'].mean()
    c = ''.join(df['var1']) + str(df['var2'].max())

    return pd.Series([a,b,c], index=returned_col_names)

res_df = df.groupby(['id']).apply(simulate_complicated_func).reset_index()
print(res_df)
    id      calc1     calc2       calc3
0  id1     abccba  6.500000     abccba9
1  id2        abc  7.000000        abc7
2  id3  cbaabccba  2.666667  cbaabccba4

Another idea is use DataFrameGroupBy.agg but it is possible only for processing all columns with aggregated functions like join and mean. Function agg working with each column separately, so cal3 is not possible easy/effective way count - is necessary again custom function and last join output together:

def simulate_complicated_func(df):
    # This function simulates complicate calculations
    returned_col_names = ['calc3']
    c = ''.join(df['var1']) + str(df['var2'].max())
    return pd.Series([c], index=returned_col_names)

d = {'var1': ''.join, 'var2':'mean'}
cols = {'var1':'calc1','var2':'calc2'}
g = df.groupby(['id'])

df1 = g.agg(d).rename(columns=cols)
print (df1)
         calc1     calc2
id                      
id1     abccba  6.500000
id2        abc  7.000000
id3  cbaabccba  2.666667

df2 = df.groupby(['id']).apply(simulate_complicated_func)
print(df2)
          calc3
id             
id1     abccba9
id2        abc7
id3  cbaabccba4

df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
    id      calc1     calc2       calc3
0  id1     abccba  6.500000     abccba9
1  id2        abc  7.000000        abc7
2  id3  cbaabccba  2.666667  cbaabccba4

Pandas groupby and apply - getting a new DataFrame over the groupby variable

Edit: Optimize how?

Answers (1)

Related Questions