brandon
brandon

Reputation: 457

Pandas crosstab double counting when using two aggregate functions?

I'm not sure if this is something I'm doing wrong or don't understand, or if it potentially a bug in either pandas crosstab or the numpy size function.

I'm using a simple dataframe from the pandas examples here

>>> df = DataFrame({'A': ['one', 'one', 'two', 'three'] * 6, 'B': ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D': np.random.randn(24), 'E': np.random.randn(24)})

Doing a simple crosstab with margins=True to get the totals works as expected:

>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True)
C        bar  foo  All
A     B               
one   A    2    2    4
      B    2    2    4
      C    2    2    4
three A    2    0    2
      B    0    2    2
      C    2    0    2
two   A    0    2    2
      B    2    0    2
      C    0    2    2
All       12   12   24

Using the np.size function directly gives the same result:

>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size])
C        bar  foo  All
A     B               
one   A    2    2    4
      B    2    2    4
      C    2    2    4
three A    2    0    2
      B    0    2    2
      C    2    0    2
two   A    0    2    2
      B    2    0    2
      C    0    2    2
All       12   12   24

Pandas allows you to pass in multiple aggregation functions to get a count and the mean in one crosstab. However when I do this, the counts for size for both foo and bar are double what they were in the previous call, yet the overall total stays correct.

>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size, np.mean], values=df['D'])
         size                mean                    
C         bar  foo  All       bar       foo       All
A     B                                              
one   A     2    2    4  0.245998  0.076366  0.161182
      B     2    2    4 -0.739757  0.137780 -0.300988
      C     2    2    4 -1.555759 -1.446554 -1.501157
three A     2  NaN    2  1.216109       NaN  1.216109
      B   NaN    2    2       NaN  0.255482  0.255482
      C     2  NaN    2  0.732448       NaN  0.732448
two   A   NaN    2    2       NaN -0.273747 -0.273747
      B     2  NaN    2 -0.001649       NaN -0.001649
      C   NaN    2    2       NaN  0.685422  0.685422
All        24   24   24 -0.017102 -0.094208 -0.055655

Am I missing something here? Why does this behave differently in the two cases?

Upvotes: 4

Views: 1595

Answers (1)

TomAugspurger
TomAugspurger

Reputation: 28946

Ok I've got some idea of what it's doing.

After digging through the source code in pandas/pandas/tools/pivot.py, it comes to the statement

row_margin = data[cols + values].groupby(cols).agg(aggfunc)

here cols is df['C'] and values is df['D']. We group those two things by cols and then apply the aggregating function, which in this case is np.size. Each row looks like

In [158]: data[cols + values].groupby(cols).nth(0)
Out[158]: 
     __dummy__
C             
bar  -1.823026
foo   0.465117

when we call np.size() on that, we of course get 2. Sum all those 2's up to get the margin and we end up with 24, twice what you may have expected if you just wanted the counts of the D'

Maybe someone else can tell us whether this is expected. I'm still a bit confused by one part of the source code. I'll edit if I figure out anything more.

Upvotes: 2

Related Questions