Mike
Mike

Reputation: 7203

Why does mean() have different behavior on empty DataFrames?

If I have an empty DataFrame in pandas like this:

df = pandas.DataFrame(columns=['a','b','c'])
>>> df
Empty DataFrame
Columns: [a, b, c]
Index: []

and I aggregate on groups, the output will usually be an empty DataFrame:

>>> df.groupby('a', as_index=False).sum()
Empty DataFrame
Columns: [a, b, c]
Index: []

I say usually because this is not always the case. It works this way for min(), max(), sum(), count(), and quantile() but not for mean(), that one raises an exception:

>>> df.groupby('a', as_index=False).mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 666, in mean
    return self._cython_agg_general('mean')
  File "/usr/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 2358, in _cython_agg_general
    new_items, new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 2408, in _cython_agg_blocks
    raise DataError('No numeric types to aggregate')
pandas.core.groupby.DataError: No numeric types to aggregate

Why is the behavior different for this one aggregate function?

I am using pandas 0.14.1 on python 2.7.

Upvotes: 2

Views: 815

Answers (2)

EdChum
EdChum

Reputation: 394031

This exception is raised for the genuine groupby functions: http://pandas.pydata.org/pandas-docs/stable/api.html#id35, when you are calling sum, this is calling the Series or df version which has no such restriction.

So in fact mean, median, sem, std, var and ohlc will all raise an exception.

Note also that if you had non-numerical data, the exception would be raised.

Compare what happens when you call apply with mean:

In [18]:

df.groupby('a', as_index=False).apply(mean)
Out[18]:
Empty DataFrame
Columns: []
Index: []

here no exception is raised as the Series or Df version is being applied.

Upvotes: 1

sedavidw
sedavidw

Reputation: 11691

I'm not exactly sure but I would hypothesize it's because mean() would divide by the number of elements in the dataframe, in this case 0. Which would cause a divide by zero error. I would just catch the error that is thrown

Upvotes: 1

Related Questions