RDJ
RDJ

Reputation: 4122

How can a pandas groupby .sum return wrong values?

When applying a groupby to a DataFrame the resultant grouped values do not sum to the same figures as when taking the column sums of the original DataFrame. How is this possible? I can't show my full data as it's sensitive and, more annoyingly, I can't seem to recreate the problem. The example below talks it through though.

Say I have this dataframe, df2:

    A      B    C   D
0   One    1    3   2
1   Two    NaN  6   4
2   Three  3    NaN 8
3   One    4    17  NaN
4   Two    7    NaN 2
5   Three  12   8   15

Summing it returns this as expected:

`df2.sum()`
    A    OneTwoThreeOneTwoThree
    B                        27
    C                        34
    D                        31
    dtype: object

Then using this groupby:

df3 = df2.groupby(['A'])[['B', 'C', 'D']].sum()

Returns this:

        B   C   D
A           
One     5   20  2
Three   15  8   23
Two     7   6   6

And summing that returns the same values as summing the original columns:

df3.sum()
B    27
C    34
D    31
dtype: float64

In my actual data, however, the original values are:

13496 non-null float64
11421 non-null float64
10890 non-null float64
10714 non-null float64

Yet after the same groupby as above using .sum(), the grouped rows sum to:

13021
11071
10568
10408

Is there some pandas caveat or gotcha I'm missing here? How are values being lost? This is obviously less of a direct programming question that can be answered, and more of a pandas 'internal workings' question. I don't see anything in the pandas documentation about this, nor are there any SO questions about it.

Upvotes: 4

Views: 11132

Answers (2)

ptrj
ptrj

Reputation: 5212

It can be explained if A has nan values.

>>> df = pd.DataFrame(np.random.randint(0,10,(5,3)))
>>> df['A'] = np.random.randint(0,2,5)
>>> df['A'][0] = np.NAN
>>> df
Out[52]: 
   0  1  2   A
0  2  1  5 NaN
1  6  9  7   0
2  5  2  6   1
3  8  4  7   0
4  2  3  4   1
>>> df.sum()
Out[53]: 
0    23
1    19
2    29
A     2
dtype: float64
>>> np.sum(df.groupby(by = 'A').sum())
Out[54]: 
0    21
1    18
2    24
dtype: int64

Upvotes: 2

root
root

Reputation: 33793

Do you have any NaN values in column A? This can produce the behavior that you're describing, because NaN values get dropped when they're being grouped. Consider the DataFrame below:

     A    B     C      D
0    x  1.0   NaN  100.0
1    x  2.0  21.0  105.0
2    y  NaN  22.0  110.0
3  NaN  4.0  23.0  115.0
4    z  5.0  24.0  120.0
5    z  6.0  25.0    NaN 

Then df.sum() produces:

B     18.0
C    115.0
D    550.0
dtype: float64

But df.groupby('A')['B', 'C', 'D'].sum().sum() produces:

B     14.0
C     92.0
D    435.0
dtype: float64 

Upvotes: 5

Related Questions