Lisa
Lisa

Reputation: 4416

pandas groupby: missing group key?

in the the DataFrame "data_to_rank", I have a column "r_DTS". data_to_rank['r_DTS'] shows:

Name: r_DTS, dtype: category
Categories (4, object): [Bottom < 2 < Top < Missing]

When I do:

>>> b = data_to_rank.groupby(['r_DTS'])
>>> for key, group in b: print(key)
Bottom
2
Top
Missing

However, when I group by 'r_DTS' with other variable, the "Missing" in "r_DTS" disapear.

>>> a = data_to_rank.groupby(['GRADE','r_DTS'])
>>> for key, group in a: print(key)
('HY', 'Bottom')
('HY', '2')
('HY', 'Top')
('IG', 'Bottom')
('IG', '2')
('IG', 'Top')

Where is ('HY', 'Missing') and ('IG', 'Missing')?

Upvotes: 1

Views: 423

Answers (1)

piRSquared
piRSquared

Reputation: 294218

When you group by a categorical, it includes all categories in the grouping, even the ones with no representation.

When you group by multiple items, even if all of them are categorical dtypes, it doesn't grant you the same privilege.

You must construct your own categorical to group by. This is an example of how to do that:

cats = pd.MultiIndex.from_product([
        data_to_rank.GRADE.cat.categories,
        data_to_rank.r_DTS.cat.categories,
    ]).map(tuple)

categorical_to_group_by = pd.Categorical(
    data_to_rank[['GRADE', 'r_DTS']].apply(tuple, 1), cats
)

g = data_to_rank.groupby(categorical_to_group_by)

for name, group in g:
    print(name)

('HY', 'Bottom')
('HY', 2)
('HY', 'Top')
('HY', 'Missing')
('IG', 'Bottom')
('IG', 2)
('IG', 'Top')
('IG', 'Missing')

Upvotes: 1

Related Questions