Reputation: 4122
When applying a groupby
to a DataFrame the resultant grouped values do not sum to the same figures as when taking the column sums of the original DataFrame. How is this possible? I can't show my full data as it's sensitive and, more annoyingly, I can't seem to recreate the problem. The example below talks it through though.
Say I have this dataframe, df2
:
A B C D
0 One 1 3 2
1 Two NaN 6 4
2 Three 3 NaN 8
3 One 4 17 NaN
4 Two 7 NaN 2
5 Three 12 8 15
Summing it returns this as expected:
`df2.sum()`
A OneTwoThreeOneTwoThree
B 27
C 34
D 31
dtype: object
Then using this groupby
:
df3 = df2.groupby(['A'])[['B', 'C', 'D']].sum()
Returns this:
B C D
A
One 5 20 2
Three 15 8 23
Two 7 6 6
And summing that returns the same values as summing the original columns:
df3.sum()
B 27
C 34
D 31
dtype: float64
In my actual data, however, the original values are:
13496 non-null float64
11421 non-null float64
10890 non-null float64
10714 non-null float64
Yet after the same groupby
as above using .sum()
, the grouped rows sum to:
13021
11071
10568
10408
Is there some pandas caveat or gotcha I'm missing here? How are values being lost? This is obviously less of a direct programming question that can be answered, and more of a pandas 'internal workings' question. I don't see anything in the pandas documentation about this, nor are there any SO questions about it.
Upvotes: 4
Views: 11132
Reputation: 5212
It can be explained if A
has nan values.
>>> df = pd.DataFrame(np.random.randint(0,10,(5,3)))
>>> df['A'] = np.random.randint(0,2,5)
>>> df['A'][0] = np.NAN
>>> df
Out[52]:
0 1 2 A
0 2 1 5 NaN
1 6 9 7 0
2 5 2 6 1
3 8 4 7 0
4 2 3 4 1
>>> df.sum()
Out[53]:
0 23
1 19
2 29
A 2
dtype: float64
>>> np.sum(df.groupby(by = 'A').sum())
Out[54]:
0 21
1 18
2 24
dtype: int64
Upvotes: 2
Reputation: 33793
Do you have any NaN
values in column A? This can produce the behavior that you're describing, because NaN
values get dropped when they're being grouped. Consider the DataFrame below:
A B C D
0 x 1.0 NaN 100.0
1 x 2.0 21.0 105.0
2 y NaN 22.0 110.0
3 NaN 4.0 23.0 115.0
4 z 5.0 24.0 120.0
5 z 6.0 25.0 NaN
Then df.sum()
produces:
B 18.0
C 115.0
D 550.0
dtype: float64
But df.groupby('A')['B', 'C', 'D'].sum().sum()
produces:
B 14.0
C 92.0
D 435.0
dtype: float64
Upvotes: 5