Reputation: 457
I'm not sure if this is something I'm doing wrong or don't understand, or if it potentially a bug in either pandas crosstab or the numpy size function.
I'm using a simple dataframe from the pandas examples here
>>> df = DataFrame({'A': ['one', 'one', 'two', 'three'] * 6, 'B': ['A', 'B', 'C'] * 8, 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, 'D': np.random.randn(24), 'E': np.random.randn(24)})
Doing a simple crosstab with margins=True to get the totals works as expected:
>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True)
C bar foo All
A B
one A 2 2 4
B 2 2 4
C 2 2 4
three A 2 0 2
B 0 2 2
C 2 0 2
two A 0 2 2
B 2 0 2
C 0 2 2
All 12 12 24
Using the np.size function directly gives the same result:
>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size])
C bar foo All
A B
one A 2 2 4
B 2 2 4
C 2 2 4
three A 2 0 2
B 0 2 2
C 2 0 2
two A 0 2 2
B 2 0 2
C 0 2 2
All 12 12 24
Pandas allows you to pass in multiple aggregation functions to get a count and the mean in one crosstab. However when I do this, the counts for size for both foo and bar are double what they were in the previous call, yet the overall total stays correct.
>>> crosstab(rows=[df['A'],df['B']], cols=[df['C']], margins=True, aggfunc=[np.size, np.mean], values=df['D'])
size mean
C bar foo All bar foo All
A B
one A 2 2 4 0.245998 0.076366 0.161182
B 2 2 4 -0.739757 0.137780 -0.300988
C 2 2 4 -1.555759 -1.446554 -1.501157
three A 2 NaN 2 1.216109 NaN 1.216109
B NaN 2 2 NaN 0.255482 0.255482
C 2 NaN 2 0.732448 NaN 0.732448
two A NaN 2 2 NaN -0.273747 -0.273747
B 2 NaN 2 -0.001649 NaN -0.001649
C NaN 2 2 NaN 0.685422 0.685422
All 24 24 24 -0.017102 -0.094208 -0.055655
Am I missing something here? Why does this behave differently in the two cases?
Upvotes: 4
Views: 1595
Reputation: 28946
Ok I've got some idea of what it's doing.
After digging through the source code in pandas/pandas/tools/pivot.py
, it comes to the statement
row_margin = data[cols + values].groupby(cols).agg(aggfunc)
here cols is df['C'] and values is df['D']. We group those two things by cols and then apply the aggregating function, which in this case is np.size. Each row looks like
In [158]: data[cols + values].groupby(cols).nth(0)
Out[158]:
__dummy__
C
bar -1.823026
foo 0.465117
when we call np.size()
on that, we of course get 2. Sum all those 2's up to get the margin and we end up with 24, twice what you may have expected if you just wanted the counts of the D
'
Maybe someone else can tell us whether this is expected. I'm still a bit confused by one part of the source code. I'll edit if I figure out anything more.
Upvotes: 2