Bart
Bart

Reputation: 1087

Pandas groupby apply strange behavior when NaN's in group column

I'm encountering some unexpected Pandas groupby-apply results, and I can't figure out the exact cause.

Below I have to dataframes that are equal except the ordering of 2 values. df1 produces results as I expect them, but df2 produces a completely different result.

import numpy as np

df1 = pd.DataFrame({'group_col': [0.0, np.nan, 0.0, 0.0], 'value_col': [2,2,2,2]})
df2 = pd.DataFrame({'group_col': [np.nan, 0.0, 0.0, 0.0], 'value_col': [2,2,2,2]})

df1:

   group_col  value_col
0        0.0          2
1        NaN          2
2        0.0          2
3        0.0          2

df2:

   group_col  value_col
0        NaN          2
1        0.0          2
2        0.0          2
3        0.0          2

When I groupby the group_col and do a value_counts of the value_col per group, including a reindex to include all possible values in the result I get the following for df1:

df1.groupby('group_col').value_col.apply(lambda x: x.value_counts().reindex(index=[1,2,3]))

group_col   
0.0        1    NaN
           2    3.0
           3    NaN
Name: value_col, dtype: float64

It correctly finds 1 group and returns a multi-index series with the value_counts for each possible value. But when I run the same on df2, I get a completely different result:

0    NaN
1    NaN
2    3.0
3    NaN
Name: value_col, dtype: float64

Here the result contains an index matching the original DataFrame instead of the multi-index I would expect. I thought it might have something to do with the group column starting with np.nan, but then I tried dropping the last row and I get the expected result again, so apparently the cause is something else.

df2.head(3).groupby('group_col').value_col.apply(lambda x: x.value_counts().reindex(index=[1,2,3]))

group_col   
0.0        1    NaN
           2    2.0
           3    NaN
Name: value_col, dtype: float64

What could be causing this?

Upvotes: 4

Views: 359

Answers (1)

Kosuke Sakai
Kosuke Sakai

Reputation: 2411

Let's begin with looking at some simple grouping calculations to understand how pandas works on it.

In the following case, grouping keys are used as index in the resulting Series object. The original index was dropped.

In [4]: df1.groupby('group_col')['value_col'] \
   ...: .apply(lambda x: {'sum': x.sum(), 'mean': x.mean()})
Out[4]: 
group_col
0.0        sum     6.0
           mean    2.0
Name: value_col, dtype: float64

In [5]: df2.groupby('group_col')['value_col'] \
   ...: .apply(lambda x: {'sum': x.sum(), 'mean': x.mean()})
Out[5]: 
group_col
0.0        sum     6.0
           mean    2.0
Name: value_col, dtype: float64

In the next case, the index of the original DataFrame is preserved. Grouping keys are not contained in the result Series.

In [6]: df1.groupby('group_col')['value_col'].apply(lambda x: x / len(x))
Out[6]: 
0    0.666667
1         NaN
2    0.666667
3    0.666667
Name: value_col, dtype: float64

In [7]: df2.groupby('group_col')['value_col'].apply(lambda x: x / len(x))
Out[7]: 
0         NaN
1    0.666667
2    0.666667
3    0.666667
Name: value_col, dtype: float64

What makes pandas behave differently when it produces the index of combined object?

Actually, this is based on whether the index was mutated by the aggregation or not. When the index is the same between the original object and the resulting object, it chooses to reuse the original index. On the other hand, when the index is different from the original object, it uses the group key in the index to form a MultiIndex.

Now, going back to the question, please notice that the index was changed for df1. For group key 0.0, the index of the original chunk was [0, 2, 3], whereas it is [1, 2, 3] after aggregation. However, for df2, the original index was [1, 2, 3], and accidentally, it was not changed by the aggregation.

Upvotes: 1

Related Questions