Reputation: 3177
I've successfully created a counting table from the data in df1
df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
'B':['yes','no','no','no','yes','yes','no','yes','yes','no'],
'C':['no','yes','yes','no','yes','no','no','no','yes','no']})
by using the following command:
df1.groupby(['A','B','C']).size().reset_index().rename(columns={0:'count'})
I've got the counting table as follows:
--+--+--+--+--+--+--+--+--+--+--+
index A B C count
0 no no no 1
1 no yes no 1
2 no yes yes 1
3 yes no no 2
4 yes no yes 2
5 yes yes no 2
6 yes yes yes 1
--+--+--+--+--+--+--+--+--+--+--+
However, the possible values of A, B, C are {'yes','no'} which means all possible combinations should be equal to 2*2*2 = 8. I could not get the exact all possible combinations because my data is not covered all cases. The expected output should be the following:
--+--+--+--+--+--+--+--+--+--+--+
index A B C count
0 no no no 1
1 no yes no 1
2 no no yes 0 <-- count = 0 because it does not exist in my data
3 no yes yes 1
4 yes no no 2
5 yes no yes 2
6 yes yes no 2
7 yes yes yes 1
--+--+--+--+--+--+--+--+--+--+--+
Furthermore, I could have missing values in my data. For example, the data with missing values is the following code:
df2 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','*','no'],
'B':['yes','*','no','no','yes','yes','no','yes','yes','no'],
'C':['no','yes','*','no','yes','no','no','no','yes','no']})
So, my data (with missing values) look like this
--+--+--+--+--+--+--+--+
index A B C
0 yes yes no
1 yes * yes
2 yes no *
3 yes no no
4 no yes yes
5 no yes no
6 yes no no
7 yes yes no
8 * yes yes
9 no no no
--+--+--+--+--+--+--+--+
where * represents a missing value. In this case, I still want to create my counting table that contains all possible cases (similar to the 2nd table). In this case, I need to ignore any rows that has at least one missing values (index = 1, 2, and 8). Therefore, the expected output should be like this.
--+--+--+--+--+--+--+--+--+--+--+
index A B C count
0 no no no 1
1 no yes no 1
2 no no yes 0 <-- count = 0 because it does not exist in my data
3 no yes yes 1
4 yes no no 2
5 yes no yes 0 <-- count = 0 because of missing value in row 2
6 yes yes no 2
7 yes yes yes 0 <-- count = 0 because of missing value in row 1 and 8
--+--+--+--+--+--+--+--+--+--+--+
May I have your suggestions of how to overcome with the coverage problem and missing value problem? Thank you in advance.
Upvotes: 1
Views: 101
Reputation: 2137
You can add a reindex after you get the size of your groupby
In [1]: idx = pd.MultiIndex.from_product([['no', 'yes']] * 3, names=['A', 'B', 'C'])
In [2]: print df1.groupby(['A', 'B', 'C']).size().reindex(idx).fillna(0).reset_index().rename(columns={0: 'count'})
A B C count
0 no no no 1
1 no no yes 0
2 no yes no 1
3 no yes yes 1
4 yes no no 2
5 yes no yes 2
6 yes yes no 2
7 yes yes yes 1
In [3]: print df2.groupby(['A', 'B', 'C']).size().reindex(idx).fillna(0).reset_index().rename(columns={0: 'count'})
A B C count
0 no no no 1
1 no no yes 0
2 no yes no 1
3 no yes yes 1
4 yes no no 2
5 yes no yes 0
6 yes yes no 2
7 yes yes yes 0
Upvotes: 2
Reputation: 16251
Let me answer your coverage problem first.
The first step is to generate a dataframe with all the possibilities. To do so, I created a MultiIndex from a Cartesian product, and reset the index (like you did):
iterables = [['yes', 'no'], ['yes', 'no'], ['yes', 'no']]
df_all = pd.DataFrame(
{'count': 0},
index=pd.MultiIndex.from_product(iterables, names=['A', 'B', 'C'])
).reset_index()
The second step uses combine_first
to combine the result with your grouped dataframe (which I call df
below). The problem is that combine_first
works by index, so I had to reindex both dataframes first.
The easiest (if not most elegant) solution I found was to set as index the concatenation of columns A
, B
, and C
:
df = df1.groupby(['A','B','C']).size().reset_index().rename(columns={0:'count'})
df.set_index(df['A'] + df['B'] + df['C'], inplace=True)
Now we're ready to use combine_first
, dropping the temporary index at the end:
df_all.set_index(df_all['A'] + df_all['B'] + df_all['C'], inplace=True)
df.combine_first(df_all).reset_index(drop=True)
The result:
A B C count
0 no no no 1
1 no no yes 0
2 no yes no 1
3 no yes yes 1
4 yes no no 2
5 yes no yes 2
6 yes yes no 2
7 yes yes yes 1
Upvotes: 0