Reputation: 2506
I might have found a bug in pandas.groupby.agg. Try the following code. It looks like what is passed to the aggregate function fn() is a data frame including the key. In my understanding, the agg function is applied to each column separately and only one column is passed. Since the 'year' column appears in groupby, it should be removed from the grouped results.
import pandas as pd
import numpy as np
df = pd.DataFrame({'year' : [2011,2011,2012,2012,2013], '5-1' : [1.2, 2.1,2.1,11., 13.]})
def fn(x):
print x
#return np.mean(x) will explode
return 0
res = df.groupby('year').agg(fn)
print res
The above gives the output, which clearly tells me that x of fn(x) is passed as a DataFrame with two columns (year, 5-1).
5-1 year
0 1.2 2011
1 2.1 2011
5-1 year
2 2.1 2012
3 11.0 2012
5-1 year
4 13 2013
5-1
year
2011 0
2012 0
2013 0
Upvotes: 0
Views: 749
Reputation: 28946
To answer your question, if you absolutely want the function applied to a Series
, use the {column: aggfunc}
syntax in .agg()
.
That said, your code seems to work fine (at least on the current master). The function isn't actually being applied to the year
column.
A bit of explanation. For this I'm assuming that you are on an older version of pandas, and that that version had a bug that has since been patched. To reproduce the behavior I think you were getting, lets redefine fn
:
In [32]: def fn(x):
print("Printing x+1 : {}".format(x + 1))
print("Printing x: {}".format(x))
return 0
And let's redefine df['year']
In [33]: df['year'] = ['a', 'a', 'b', 'b', 'c']
All these objects are defined in pandas/core/groupby.py
.
The df.groupby('year')
part returns a DataFrameGroupby
object, since df
is a DataFrame
. .agg()
isn't actually defined on DataFrameGroupBy
, that's on its parent class NDFrameGroupBy
.
Since this ins't a Cython function, things get handed off to NDFrameGroupBy._aggregate_generic()
. That tries to execute the function, and if it fails, falls back to a separate section of code:
try:
for name, data in self:
result[name] = self._try_cast(func(data, *args, **kwargs),
data)
except Exception:
return self._aggregate_item_by_item(func, *args, **kwargs)
If the try
part succeeds, the function is applied to the entire object (which is why print x
shows both columns), and the results are presented nicely with the grouper on the index and the values in the columns.
If the try
part fails, things are handed off to _aggregate_item_by_item
, which excludes the grouping column.
This means that by changing your code from return np.mean(x)
to return 0
, you actually changed the path the code follows. Before, when you tried to take the mean
, I think it failed and fell back to _aggregate_item_by_item
(That's why I had you redefine df['year']
, and fn
, that will fail for sure). But when you switched to return 0
, that succeeded, and so followed the try
part.
This is all just a bit of guesswork, but I think that's what's happening.
I'm actually working on the group by code right now, and this issue has come up (see here). I don't think the function should ever be applied to the grouping column, but it sometimes is (R does the same). Post there if you have an opinion on the matter.
Upvotes: 2
Reputation: 13259
If year
weren't included in the aggregation, how would you know what group you were aggregating over?
Upvotes: 0