Creating a counting table from data that is not cover all possible combinations and missing value is available in pandas data frame

I've successfully created a counting table from the data in df1

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
'B':['yes','no','no','no','yes','yes','no','yes','yes','no'],
'C':['no','yes','yes','no','yes','no','no','no','yes','no']})

by using the following command:

df1.groupby(['A','B','C']).size().reset_index().rename(columns={0:'count'})

I've got the counting table as follows:

--+--+--+--+--+--+--+--+--+--+--+
index    A     B    C    count
0       no    no   no        1
1       no   yes   no        1
2       no   yes  yes        1
3      yes    no   no        2
4      yes    no  yes        2
5      yes   yes   no        2
6      yes   yes  yes        1
--+--+--+--+--+--+--+--+--+--+--+

However, the possible values of A, B, C are {'yes','no'} which means all possible combinations should be equal to 2*2*2 = 8. I could not get the exact all possible combinations because my data is not covered all cases. The expected output should be the following:

--+--+--+--+--+--+--+--+--+--+--+
index    A     B    C    count
0       no    no   no        1
1       no   yes   no        1
2       no    no  yes        0    <-- count = 0 because it does not exist in my data
3       no   yes  yes        1
4      yes    no   no        2
5      yes    no  yes        2
6      yes   yes   no        2
7      yes   yes  yes        1
--+--+--+--+--+--+--+--+--+--+--+

Furthermore, I could have missing values in my data. For example, the data with missing values is the following code:

df2 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','*','no'],
'B':['yes','*','no','no','yes','yes','no','yes','yes','no'],
'C':['no','yes','*','no','yes','no','no','no','yes','no']})

So, my data (with missing values) look like this

--+--+--+--+--+--+--+--+
index    A     B    C
0      yes   yes   no
1      yes     *  yes
2      yes    no    *
3      yes    no   no
4       no   yes  yes
5       no   yes   no
6      yes    no   no
7      yes   yes   no
8        *   yes  yes
9       no    no   no
--+--+--+--+--+--+--+--+

where * represents a missing value. In this case, I still want to create my counting table that contains all possible cases (similar to the 2nd table). In this case, I need to ignore any rows that has at least one missing values (index = 1, 2, and 8). Therefore, the expected output should be like this.

--+--+--+--+--+--+--+--+--+--+--+
index    A     B    C    count
0       no    no   no        1
1       no   yes   no        1
2       no    no  yes        0   <-- count = 0 because it does not exist in my data
3       no   yes  yes        1
4      yes    no   no        2
5      yes    no  yes        0   <-- count = 0 because of missing value in row 2
6      yes   yes   no        2
7      yes   yes  yes        0   <-- count = 0 because of missing value in row 1 and 8
--+--+--+--+--+--+--+--+--+--+--+

May I have your suggestions of how to overcome with the coverage problem and missing value problem? Thank you in advance.

Upvotes: 1

Views: 101

Answers (2)

Colin
Colin

Reputation: 2137

You can add a reindex after you get the size of your groupby

In [1]: idx = pd.MultiIndex.from_product([['no', 'yes']] * 3, names=['A', 'B', 'C'])

In [2]: print df1.groupby(['A', 'B', 'C']).size().reindex(idx).fillna(0).reset_index().rename(columns={0: 'count'})
     A    B    C  count
0   no   no   no      1
1   no   no  yes      0
2   no  yes   no      1
3   no  yes  yes      1
4  yes   no   no      2
5  yes   no  yes      2
6  yes  yes   no      2
7  yes  yes  yes      1

In [3]: print df2.groupby(['A', 'B', 'C']).size().reindex(idx).fillna(0).reset_index().rename(columns={0: 'count'})
     A    B    C  count
0   no   no   no      1
1   no   no  yes      0
2   no  yes   no      1
3   no  yes  yes      1
4  yes   no   no      2
5  yes   no  yes      0
6  yes  yes   no      2
7  yes  yes  yes      0

Upvotes: 2

IanS
IanS

Reputation: 16251

Let me answer your coverage problem first.

The first step is to generate a dataframe with all the possibilities. To do so, I created a MultiIndex from a Cartesian product, and reset the index (like you did):

iterables = [['yes', 'no'], ['yes', 'no'], ['yes', 'no']]
df_all = pd.DataFrame(
        {'count': 0}, 
        index=pd.MultiIndex.from_product(iterables, names=['A', 'B', 'C'])
    ).reset_index()

The second step uses combine_first to combine the result with your grouped dataframe (which I call df below). The problem is that combine_first works by index, so I had to reindex both dataframes first.

The easiest (if not most elegant) solution I found was to set as index the concatenation of columns A, B, and C:

df = df1.groupby(['A','B','C']).size().reset_index().rename(columns={0:'count'})
df.set_index(df['A'] + df['B'] + df['C'], inplace=True)

Now we're ready to use combine_first, dropping the temporary index at the end:

df_all.set_index(df_all['A'] + df_all['B'] + df_all['C'], inplace=True)
df.combine_first(df_all).reset_index(drop=True)

The result:

     A    B    C  count
0   no   no   no      1
1   no   no  yes      0
2   no  yes   no      1
3   no  yes  yes      1
4  yes   no   no      2
5  yes   no  yes      2
6  yes  yes   no      2
7  yes  yes  yes      1

Upvotes: 0

Related Questions