Reputation: 1087
I'm encountering some unexpected Pandas groupby-apply results, and I can't figure out the exact cause.
Below I have to dataframes that are equal except the ordering of 2 values. df1 produces results as I expect them, but df2 produces a completely different result.
import numpy as np
df1 = pd.DataFrame({'group_col': [0.0, np.nan, 0.0, 0.0], 'value_col': [2,2,2,2]})
df2 = pd.DataFrame({'group_col': [np.nan, 0.0, 0.0, 0.0], 'value_col': [2,2,2,2]})
df1:
group_col value_col
0 0.0 2
1 NaN 2
2 0.0 2
3 0.0 2
df2:
group_col value_col
0 NaN 2
1 0.0 2
2 0.0 2
3 0.0 2
When I groupby the group_col
and do a value_counts of the value_col
per group, including a reindex to include all possible values in the result I get the following for df1:
df1.groupby('group_col').value_col.apply(lambda x: x.value_counts().reindex(index=[1,2,3]))
group_col
0.0 1 NaN
2 3.0
3 NaN
Name: value_col, dtype: float64
It correctly finds 1 group and returns a multi-index series with the value_counts for each possible value. But when I run the same on df2, I get a completely different result:
0 NaN
1 NaN
2 3.0
3 NaN
Name: value_col, dtype: float64
Here the result contains an index matching the original DataFrame instead of the multi-index I would expect. I thought it might have something to do with the group column starting with np.nan, but then I tried dropping the last row and I get the expected result again, so apparently the cause is something else.
df2.head(3).groupby('group_col').value_col.apply(lambda x: x.value_counts().reindex(index=[1,2,3]))
group_col
0.0 1 NaN
2 2.0
3 NaN
Name: value_col, dtype: float64
What could be causing this?
Upvotes: 4
Views: 359
Reputation: 2411
Let's begin with looking at some simple grouping calculations to understand how pandas works on it.
In the following case, grouping keys are used as index in the resulting Series
object. The original index was dropped.
In [4]: df1.groupby('group_col')['value_col'] \
...: .apply(lambda x: {'sum': x.sum(), 'mean': x.mean()})
Out[4]:
group_col
0.0 sum 6.0
mean 2.0
Name: value_col, dtype: float64
In [5]: df2.groupby('group_col')['value_col'] \
...: .apply(lambda x: {'sum': x.sum(), 'mean': x.mean()})
Out[5]:
group_col
0.0 sum 6.0
mean 2.0
Name: value_col, dtype: float64
In the next case, the index of the original DataFrame
is preserved. Grouping keys are not contained in the result Series
.
In [6]: df1.groupby('group_col')['value_col'].apply(lambda x: x / len(x))
Out[6]:
0 0.666667
1 NaN
2 0.666667
3 0.666667
Name: value_col, dtype: float64
In [7]: df2.groupby('group_col')['value_col'].apply(lambda x: x / len(x))
Out[7]:
0 NaN
1 0.666667
2 0.666667
3 0.666667
Name: value_col, dtype: float64
What makes pandas behave differently when it produces the index of combined object?
Actually, this is based on whether the index was mutated by the aggregation or not. When the index is the same between the original object and the resulting object, it chooses to reuse the original index. On the other hand, when the index is different from the original object, it uses the group key in the index to form a MultiIndex
.
Now, going back to the question, please notice that the index was changed for df1
. For group key 0.0
, the index of the original chunk was [0, 2, 3]
, whereas it is [1, 2, 3]
after aggregation. However, for df2
, the original index was [1, 2, 3]
, and accidentally, it was not changed by the aggregation.
Upvotes: 1