Bejita
Bejita

Reputation: 53

Pandas groupby 0 value if does not exist

I have a code like this

frame[frame['value_text'].str.match('Type 2')  | frame['value_text'].str.match('Type II diabetes')].groupby(['value_text','gender'])['value_text'].count()

which returns a series like

value_text            gender      count
type 2                  M           4
type 2 without...       M           4
                        F           3

what I want is

 value_text               gender      count
    type 2                  M           4
                            F           0
    type 2 without...       M           4
                            F           3

I want to include count for all genders even though there is no record in the dataframe. how can I do this?

Upvotes: 3

Views: 3600

Answers (4)

jpp
jpp

Reputation: 164773

Categorical Data was introduced in pandas specifically for this purpose.

In effect, groupby operations with categorical data automatically calculate the Cartesian product.

You should see additional benefits compared to other functional methods: lower memory usage and data validation.

import pandas as pd

df = pd.DataFrame({'value_text': ['type2', 'type2 without', 'type2'],
                   'gender': ['M', 'F', 'M'],
                   'value': [1, 2, 3]})

df['gender'] = df['gender'].astype('category')

res = df.groupby(['value_text', 'gender']).count()\
        .fillna(0).astype(int)\
        .reset_index()

print(res)

      value_text gender  value
0          type2      F      0
1          type2      M      2
2  type2 without      F      1
3  type2 without      M      0

Upvotes: 4

jxc
jxc

Reputation: 13998

Remember, whenever you want to force a specific list to index/shape your data. Pivot, crosstab, stack, unstack are not reliable since they highly depend on the input data. For example, if 'M' is never shown in any input row, you will not see 'M' no matter how you pivot/unstack your result. This kind of problem is where reindex() shines.

Assume your pre-processed frame is saved as df:

  mdx1 = pd.MultiIndex.from_product([df.index.levels[0], ['M', 'F']])
  df.reindex(mdx1).fillna(0, downcast='infer')

On the other hand, if you just want all possible level-1 values to be shown in all level-0, do the following:

  mdx1 = pd.MultiIndex.from_product(df.index.levels)
  df.reindex(mdx1).fillna(0, downcast='infer')

This can be easily extended to dataframes with more than 2-level indexes.

Update: use Categorical data-type might fix the problems pivot-like functions have.

Upvotes: 1

cmaher
cmaher

Reputation: 5215

The simplest way to do this is with pd.crosstab and then stack:

# save your filtered dataframe as an intermediate result, for convenience
type2 = frame[frame.value_text.str.match('Type 2|Type II diabetes')]

pd.crosstab(type2.value_text, type2.gender).stack()

Upvotes: 0

Peter Leimbigler
Peter Leimbigler

Reputation: 11105

Try appending .unstack().fillna(0).stack() to your current line, like so:

frame[frame['value_text'].str.match('Type 2')  |
      frame['value_text'].str.match('Type II diabetes')]\
.groupby(['value_text','gender'])['value_text'].count()\
.unstack().fillna(0).stack()

Upvotes: 1

Related Questions