Possible Bug in pandas.groupby.agg?

Question

I might have found a bug in pandas.groupby.agg. Try the following code. It looks like what is passed to the aggregate function fn() is a data frame including the key. In my understanding, the agg function is applied to each column separately and only one column is passed. Since the 'year' column appears in groupby, it should be removed from the grouped results.

import pandas as pd
import numpy as np

df = pd.DataFrame({'year' : [2011,2011,2012,2012,2013], '5-1' : [1.2, 2.1,2.1,11., 13.]})

def fn(x):
    print x
    #return np.mean(x) will explode
    return 0


res = df.groupby('year').agg(fn)
print res

The above gives the output, which clearly tells me that x of fn(x) is passed as a DataFrame with two columns (year, 5-1).

   5-1  year
0  1.2  2011
1  2.1  2011
    5-1  year
2   2.1  2012
3  11.0  2012
   5-1  year
4   13  2013
      5-1
year     
2011    0
2012    0
2013    0

TomAugspurger · Accepted Answer

To answer your question, if you absolutely want the function applied to a Series, use the {column: aggfunc} syntax in .agg().

That said, your code seems to work fine (at least on the current master). The function isn't actually being applied to the year column.

A bit of explanation. For this I'm assuming that you are on an older version of pandas, and that that version had a bug that has since been patched. To reproduce the behavior I think you were getting, lets redefine fn:

In [32]: def fn(x):
    print("Printing x+1 : {}".format(x + 1))
    print("Printing x: {}".format(x))
    return 0

And let's redefine df['year']

In [33]: df['year'] = ['a', 'a', 'b', 'b', 'c']

All these objects are defined in pandas/core/groupby.py. The df.groupby('year') part returns a DataFrameGroupby object, since df is a DataFrame. .agg() isn't actually defined on DataFrameGroupBy, that's on its parent class NDFrameGroupBy.

Since this ins't a Cython function, things get handed off to NDFrameGroupBy._aggregate_generic(). That tries to execute the function, and if it fails, falls back to a separate section of code:

    try:
        for name, data in self:
            result[name] = self._try_cast(func(data, *args, **kwargs),
                                          data)
    except Exception:
        return self._aggregate_item_by_item(func, *args, **kwargs)

If the try part succeeds, the function is applied to the entire object (which is why print x shows both columns), and the results are presented nicely with the grouper on the index and the values in the columns.

If the try part fails, things are handed off to _aggregate_item_by_item, which excludes the grouping column.

This means that by changing your code from return np.mean(x) to return 0, you actually changed the path the code follows. Before, when you tried to take the mean, I think it failed and fell back to _aggregate_item_by_item (That's why I had you redefine df['year'], and fn, that will fail for sure). But when you switched to return 0, that succeeded, and so followed the try part.

This is all just a bit of guesswork, but I think that's what's happening.

I'm actually working on the group by code right now, and this issue has come up (see here). I don't think the function should ever be applied to the grouping column, but it sometimes is (R does the same). Post there if you have an opinion on the matter.

Possible Bug in pandas.groupby.agg?

Answers (2)

Related Questions