Creating a counting table from data that is not cover all possible combinations and missing value is available in pandas data frame

Question

I've successfully created a counting table from the data in df1

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
'B':['yes','no','no','no','yes','yes','no','yes','yes','no'],
'C':['no','yes','yes','no','yes','no','no','no','yes','no']})

by using the following command:

df1.groupby(['A','B','C']).size().reset_index().rename(columns={0:'count'})

I've got the counting table as follows:

--+--+--+--+--+--+--+--+--+--+--+
index    A     B    C    count
0       no    no   no        1
1       no   yes   no        1
2       no   yes  yes        1
3      yes    no   no        2
4      yes    no  yes        2
5      yes   yes   no        2
6      yes   yes  yes        1
--+--+--+--+--+--+--+--+--+--+--+

However, the possible values of A, B, C are {'yes','no'} which means all possible combinations should be equal to 2*2*2 = 8. I could not get the exact all possible combinations because my data is not covered all cases. The expected output should be the following:

--+--+--+--+--+--+--+--+--+--+--+
index    A     B    C    count
0       no    no   no        1
1       no   yes   no        1
2       no    no  yes        0    <-- count = 0 because it does not exist in my data
3       no   yes  yes        1
4      yes    no   no        2
5      yes    no  yes        2
6      yes   yes   no        2
7      yes   yes  yes        1
--+--+--+--+--+--+--+--+--+--+--+

Furthermore, I could have missing values in my data. For example, the data with missing values is the following code:

df2 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','*','no'],
'B':['yes','*','no','no','yes','yes','no','yes','yes','no'],
'C':['no','yes','*','no','yes','no','no','no','yes','no']})

So, my data (with missing values) look like this

--+--+--+--+--+--+--+--+
index    A     B    C
0      yes   yes   no
1      yes     *  yes
2      yes    no    *
3      yes    no   no
4       no   yes  yes
5       no   yes   no
6      yes    no   no
7      yes   yes   no
8        *   yes  yes
9       no    no   no
--+--+--+--+--+--+--+--+

where * represents a missing value. In this case, I still want to create my counting table that contains all possible cases (similar to the 2nd table). In this case, I need to ignore any rows that has at least one missing values (index = 1, 2, and 8). Therefore, the expected output should be like this.

--+--+--+--+--+--+--+--+--+--+--+
index    A     B    C    count
0       no    no   no        1
1       no   yes   no        1
2       no    no  yes        0   <-- count = 0 because it does not exist in my data
3       no   yes  yes        1
4      yes    no   no        2
5      yes    no  yes        0   <-- count = 0 because of missing value in row 2
6      yes   yes   no        2
7      yes   yes  yes        0   <-- count = 0 because of missing value in row 1 and 8
--+--+--+--+--+--+--+--+--+--+--+

May I have your suggestions of how to overcome with the coverage problem and missing value problem? Thank you in advance.

Colin · Accepted Answer

You can add a reindex after you get the size of your groupby

In [1]: idx = pd.MultiIndex.from_product([['no', 'yes']] * 3, names=['A', 'B', 'C'])

In [2]: print df1.groupby(['A', 'B', 'C']).size().reindex(idx).fillna(0).reset_index().rename(columns={0: 'count'})
     A    B    C  count
0   no   no   no      1
1   no   no  yes      0
2   no  yes   no      1
3   no  yes  yes      1
4  yes   no   no      2
5  yes   no  yes      2
6  yes  yes   no      2
7  yes  yes  yes      1

In [3]: print df2.groupby(['A', 'B', 'C']).size().reindex(idx).fillna(0).reset_index().rename(columns={0: 'count'})
     A    B    C  count
0   no   no   no      1
1   no   no  yes      0
2   no  yes   no      1
3   no  yes  yes      1
4  yes   no   no      2
5  yes   no  yes      0
6  yes  yes   no      2
7  yes  yes  yes      0

Creating a counting table from data that is not cover all possible combinations and missing value is available in pandas data frame

Answers (2)

Related Questions