GZ0
GZ0

Reputation: 4273

Issue of Aggregating Categorical Column

I encountered an issue in pandas 0.25.2 while aggregating multiple columns that include a categorical column.

import pandas as pd

df = pd.DataFrame({
    "col1": [1, 3, 4, 1], 
    "col2": pd.Categorical(["b", "a", "c", "b"], categories=["a", "b", "c"], ordered=False), 
    "col3": [4, 5, 3, 2]
})
df_agg = df.groupby("col1").agg(
    col2=pd.NamedAgg("col2", "first"),
    col3_max=pd.NamedAgg("col3", "max")
)
print(df_agg)

Output:

  col2  col3_max
0    b       NaN
1    a       4.0
2    c       NaN
3  NaN       5.0
4  NaN       3.0

Expected Output:

  col2  col3_max
1    b       4
3    a       5
4    c       3

The issue seems to be caused by the following behaviour of aggregating a categorical column.

df_grouped_col2 = df.groupby("col1")["col2"].first()
print(type(df_grouped_col2))
print(df_grouped_col2)

Output:

<class 'pandas.core.arrays.categorical.Categorical'>
[b, a, c]
Categories (3, object): [a, b, c]

Is this a bug? If so, is there a workaround?

Upvotes: 4

Views: 77

Answers (1)

jezrael
jezrael

Reputation: 862731

I think it is bug, but possible solution is use lambda function with iat for first value of group:

df_agg = df.groupby("col1").agg(
    col2=pd.NamedAgg("col2", lambda x: x.iat[0]),
    col3_max=pd.NamedAgg("col3", "max")
)
print(df_agg)
     col2  col3_max
col1               
1       b         4
3       a         5
4       c         3

df_grouped_col2 = df.groupby("col1")["col2"].agg(lambda x: x.iat[0])
print(type(df_grouped_col2))
<class 'pandas.core.series.Series'>

print(df_grouped_col2)
col1
1    b
3    a
4    c
Name: col2, dtype: object

Upvotes: 4

Related Questions