Ashkan
Ashkan

Reputation: 23

Custom Aggregate Function in Python

I have been struggling with a problem with custom aggregate function in Pandas that I have not been able to figure it out. let's consider the following data frame:

import numpy as np
import pandas as pd
df = pd.DataFrame({'value': np.arange(1, 5), 'weights':np.arange(1, 5)})

Now if, I want to calculate the the average of the value column using the agg in Panadas, it would be:

df.agg({'value': 'mean'})

which results in a scaler value of 2.5 as shown in the following: enter image description here

However, if I define the following custom mean function:

def my_mean(vec):
    return np.mean(vec)

and use it in the following code:

df.agg({'value': my_mean})

I would get the following result:

enter image description here

So, the question here is, what should I do to get the same result as default mean aggregate function. One more thing to note that, if I use the mean function as a method in the custom function (shown below), it works just fine, however, I would like to know how to use np.mean function in my custom function. Any help would be much appreciated!

df my_mean2(vec):
   return vec.mean()

Upvotes: 2

Views: 262

Answers (1)

Mohammad Jafar Mashhadi
Mohammad Jafar Mashhadi

Reputation: 4251

When you pass a callable as the aggregate function, if that callable is not one of the predefined callables like np.mean, np.sum, etc It'll treat it as a transform and acts like df.apply().

The way around it is to let pandas know that your callable expects a vector of values. A crude way to do it is to have sth like:

def my_mean(vals):
    print(type(vals))
    try:
        vals.shape
    except:
        raise TypeError()

    return np.mean(vals)

>>> df.agg({'value': my_mean})
<class 'int'>
<class 'pandas.core.series.Series'> 
value    2.5
dtype: float64

You see, at first pandas tries to call the function on each row (df.apply), but my_mean raises a type error and in the second attempt it'll pass the whole column as a Series object. Comment the try...except part out and you'll see my_mean will be called on each row with an int argument.


more on the first part:

my_mean1 = np.mean
my_mean2 = lambda *args, **kwargs: np.mean(*args, **kwargs)

df.agg({'value': my_mean1})
df.agg({'value': my_mean2})

Although my_mean2 and np.mean are essentially the same, since my_mean2 is np.mean evaluates to false, it'll go down the df.apply route while my_mean1 will work as expected.

Upvotes: 2

Related Questions