Reputation: 23
I have been struggling with a problem with custom aggregate function in Pandas that I have not been able to figure it out. let's consider the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'value': np.arange(1, 5), 'weights':np.arange(1, 5)})
Now if, I want to calculate the the average of the value
column using the agg
in Panadas
, it would be:
df.agg({'value': 'mean'})
which results in a scaler value of 2.5 as shown in the following:
However, if I define the following custom mean
function:
def my_mean(vec):
return np.mean(vec)
and use it in the following code:
df.agg({'value': my_mean})
I would get the following result:
So, the question here is, what should I do to get the same result as default mean
aggregate function. One more thing to note that, if I use the mean
function as a method in the custom function (shown below), it works just fine, however, I would like to know how to use np.mean
function in my custom function. Any help would be much appreciated!
df my_mean2(vec):
return vec.mean()
Upvotes: 2
Views: 262
Reputation: 4251
When you pass a callable as the aggregate function, if that callable is not one of the predefined callables like np.mean
, np.sum
, etc It'll treat it as a transform and acts like df.apply()
.
The way around it is to let pandas know that your callable expects a vector of values. A crude way to do it is to have sth like:
def my_mean(vals):
print(type(vals))
try:
vals.shape
except:
raise TypeError()
return np.mean(vals)
>>> df.agg({'value': my_mean})
<class 'int'>
<class 'pandas.core.series.Series'>
value 2.5
dtype: float64
You see, at first pandas tries to call the function on each row (df.apply
), but my_mean
raises a type error and in the second attempt it'll pass the whole column as a Series
object. Comment the try...except part out and you'll see my_mean
will be called on each row with an int
argument.
more on the first part:
my_mean1 = np.mean
my_mean2 = lambda *args, **kwargs: np.mean(*args, **kwargs)
df.agg({'value': my_mean1})
df.agg({'value': my_mean2})
Although my_mean2
and np.mean
are essentially the same, since my_mean2 is np.mean
evaluates to false, it'll go down the df.apply
route while my_mean1
will work as expected.
Upvotes: 2