Codutie
Codutie

Reputation: 1117

how to group by column and summarise by own function in Python

let df be our test dataframe from Pandas:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

what i want to do now is actually to group by let's say column A, something like:

df.groupby(['A'])['C'].sum()

that works fine. Now instead of using sum() I want to apply a own function to summarise the data in an efficient way.

The equivalent in R would be:

require(plyr); require(dplyr)

df = data.frame(A = c('foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'),
                B = c('one', 'one', 'two', 'three','two', 'two', 'one', 'three'),
                C = rnorm(8),
                D = rnorm(8))

with for example this function called myfun:

myfun <- function(x){sum(x**2)}

then:

df %>% 
   group_by(A) %>% 
   summarise(result = myfun(C))

I hope the question was clear enough. Many thanks!

Upvotes: 2

Views: 995

Answers (1)

miradulo
miradulo

Reputation: 29740

You could either use agg and place your custom function in a lambda, e.g.

>>> df.groupby('A').C.agg(lambda x: x.pow(2).sum())
A
bar    3.787664
foo    2.448404
Name: C, dtype: float64

Or you could define it separately and pass it to agg.

def sum2(x):
    return x.pow(2).sum()


>>> df.groupby('A').C.agg(sum2)
A
bar    3.787664
foo    2.448404
Name: C, dtype: float64

Note also that agg accepts lots of things for the function argument so it is fairly flexible. From the docs, the arg function used for aggregating groups can at the moment be a:

  • string cythonized function name
  • function
  • list of functions
  • dict of columns -> functions
  • nested dict of names -> dicts of functions

Upvotes: 3

Related Questions