Reputation: 4273
I encountered an issue in pandas
0.25.2 while aggregating multiple columns that include a categorical column.
import pandas as pd
df = pd.DataFrame({
"col1": [1, 3, 4, 1],
"col2": pd.Categorical(["b", "a", "c", "b"], categories=["a", "b", "c"], ordered=False),
"col3": [4, 5, 3, 2]
})
df_agg = df.groupby("col1").agg(
col2=pd.NamedAgg("col2", "first"),
col3_max=pd.NamedAgg("col3", "max")
)
print(df_agg)
Output:
col2 col3_max
0 b NaN
1 a 4.0
2 c NaN
3 NaN 5.0
4 NaN 3.0
Expected Output:
col2 col3_max
1 b 4
3 a 5
4 c 3
The issue seems to be caused by the following behaviour of aggregating a categorical column.
df_grouped_col2 = df.groupby("col1")["col2"].first()
print(type(df_grouped_col2))
print(df_grouped_col2)
Output:
<class 'pandas.core.arrays.categorical.Categorical'>
[b, a, c]
Categories (3, object): [a, b, c]
Is this a bug? If so, is there a workaround?
Upvotes: 4
Views: 77
Reputation: 862731
I think it is bug, but possible solution is use lambda function with iat
for first value of group:
df_agg = df.groupby("col1").agg(
col2=pd.NamedAgg("col2", lambda x: x.iat[0]),
col3_max=pd.NamedAgg("col3", "max")
)
print(df_agg)
col2 col3_max
col1
1 b 4
3 a 5
4 c 3
df_grouped_col2 = df.groupby("col1")["col2"].agg(lambda x: x.iat[0])
print(type(df_grouped_col2))
<class 'pandas.core.series.Series'>
print(df_grouped_col2)
col1
1 b
3 a
4 c
Name: col2, dtype: object
Upvotes: 4