how to group by column and summarise by own function in Python

Question

let df be our test dataframe from Pandas:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

what i want to do now is actually to group by let's say column A, something like:

df.groupby(['A'])['C'].sum()

that works fine. Now instead of using sum() I want to apply a own function to summarise the data in an efficient way.

The equivalent in R would be:

require(plyr); require(dplyr)

df = data.frame(A = c('foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'),
                B = c('one', 'one', 'two', 'three','two', 'two', 'one', 'three'),
                C = rnorm(8),
                D = rnorm(8))

with for example this function called myfun:

myfun <- function(x){sum(x**2)}

then:

df %>% 
   group_by(A) %>% 
   summarise(result = myfun(C))

I hope the question was clear enough. Many thanks!

miradulo · Accepted Answer

You could either use agg and place your custom function in a lambda, e.g.

>>> df.groupby('A').C.agg(lambda x: x.pow(2).sum())
A
bar    3.787664
foo    2.448404
Name: C, dtype: float64

Or you could define it separately and pass it to agg.

def sum2(x):
    return x.pow(2).sum()


>>> df.groupby('A').C.agg(sum2)
A
bar    3.787664
foo    2.448404
Name: C, dtype: float64

Note also that agg accepts lots of things for the function argument so it is fairly flexible. From the docs, the arg function used for aggregating groups can at the moment be a:

string cythonized function name
function
list of functions
dict of columns -> functions
nested dict of names -> dicts of functions

how to group by column and summarise by own function in Python

Answers (1)

Related Questions