Reputation: 53
I have a code like this
frame[frame['value_text'].str.match('Type 2') | frame['value_text'].str.match('Type II diabetes')].groupby(['value_text','gender'])['value_text'].count()
which returns a series like
value_text gender count
type 2 M 4
type 2 without... M 4
F 3
what I want is
value_text gender count
type 2 M 4
F 0
type 2 without... M 4
F 3
I want to include count for all genders even though there is no record in the dataframe. how can I do this?
Upvotes: 3
Views: 3600
Reputation: 164773
Categorical Data was introduced in pandas
specifically for this purpose.
In effect, groupby
operations with categorical data automatically calculate the Cartesian product.
You should see additional benefits compared to other functional methods: lower memory usage and data validation.
import pandas as pd
df = pd.DataFrame({'value_text': ['type2', 'type2 without', 'type2'],
'gender': ['M', 'F', 'M'],
'value': [1, 2, 3]})
df['gender'] = df['gender'].astype('category')
res = df.groupby(['value_text', 'gender']).count()\
.fillna(0).astype(int)\
.reset_index()
print(res)
value_text gender value
0 type2 F 0
1 type2 M 2
2 type2 without F 1
3 type2 without M 0
Upvotes: 4
Reputation: 13998
Remember, whenever you want to force a specific list to index/shape your data. Pivot, crosstab, stack, unstack are not reliable since they highly depend on the input data. For example, if 'M' is never shown in any input row, you will not see 'M' no matter how you pivot/unstack your result. This kind of problem is where reindex() shines.
Assume your pre-processed frame is saved as df:
mdx1 = pd.MultiIndex.from_product([df.index.levels[0], ['M', 'F']])
df.reindex(mdx1).fillna(0, downcast='infer')
On the other hand, if you just want all possible level-1 values to be shown in all level-0, do the following:
mdx1 = pd.MultiIndex.from_product(df.index.levels)
df.reindex(mdx1).fillna(0, downcast='infer')
This can be easily extended to dataframes with more than 2-level indexes.
Update: use Categorical data-type might fix the problems pivot-like functions have.
Upvotes: 1
Reputation: 5215
The simplest way to do this is with pd.crosstab
and then stack
:
# save your filtered dataframe as an intermediate result, for convenience
type2 = frame[frame.value_text.str.match('Type 2|Type II diabetes')]
pd.crosstab(type2.value_text, type2.gender).stack()
Upvotes: 0
Reputation: 11105
Try appending .unstack().fillna(0).stack()
to your current line, like so:
frame[frame['value_text'].str.match('Type 2') |
frame['value_text'].str.match('Type II diabetes')]\
.groupby(['value_text','gender'])['value_text'].count()\
.unstack().fillna(0).stack()
Upvotes: 1