Generating Count Table in Python pandas

Question

I am trying to convert a pandas dataframe into one of cumulative counts/percentages. My current dataframe looks like this:

Name    Purchase
alice   apple
bob     orange
dave    orange
bob     apple
bob     apple
alice   apple

And I'd like to turn this table into one that shows the counts as follows:

Name    tot-purchases    num-apple    percent-apple
alice   2               2            100
bob     3               2            66.67
dave    1               0            0

I know I can use df['Name'].value_counts() to get the 'tot-purchases', but I can't figure out how to use 'Name' as a key for the other two columns. I'm just lacking a basic understanding of what groupby and the selection [] operator does on dataframes. For instance, I would have thought that this would give me a DataFrame of just instances with apples, but that's giving me an invalid type comparison:

df[df['Purchase'] == 'apple']['Name']]

TheBlackCat · Accepted Answer

Something like this should work. It will be faster than apply since it uses vectorized operations. Further, rather than hard-coding the apple result, it gives you counts and percentages for all purchases, no matter how many you might have.

>>> df2 = df.reset_index().groupby(['Name', 'Purchase']).count().unstack('Purchase').fillna(0)
>>> df2.columns = df2.columns.droplevel(0)
>>>
>>> totals = df2.sum(axis=1)
>>> totals.name = 'tot-purchases'
>>>
>>> df3 = df2.divide(df2.sum(axis=1), axis=0)
>>>
>>> df2.columns = ['num-'+x for x in df2.columns]
>>> df3.columns = ['percent-'+x for x in df3.columns]
>>> dff = pd.concat([totals, df2, df3], axis=1)
>>>
>>> print(dff)
       tot-purchases  num-apple  num-orange  percent-apple  percent-orange
Name                                                                      
alice              2          2           0       1.000000        0.000000
bob                3          2           1       0.666667        0.333333
dave               1          0           1       0.000000        1.000000
>>> print(dff.loc[:,('tot-purchases', 'num-apple', 'percent-apple')])
       tot-purchases  num-apple  percent-apple
Name                                          
alice              2          2       1.000000
bob                3          2       0.666667
dave               1          0       0.000000

Basically it divides the data up into groups by Name and Purchase, then counts how many are in each group. Then it sets the Purchase name to be a column heading, giving you a 2D DataFrame where the index is Name, the columns are the Purchase type, and the values are the counts of that Purchase type for the given Name. Then it is a matter of arithmetic to get percentages and totals.

If you are willing to change the output a bit, you can do something even more useful with a MultiIndex:

>>> df2 = df.reset_index().groupby(['Name', 'Purchase']).count().unstack('Purchase').fillna(0)
>>> df2.columns.rename('Value',level=0, inplace=True)
>>> df2.columns = df2.columns.set_levels(['Count'], level=0)
>>>
>>> totals = df2.sum(axis=1)
>>> totals.name = ('Count', 'all')
>>>
>>> df3 = df2.divide(df2.sum(axis=1), axis=0)
>>> df3.columns = df3.columns.set_levels(['Percent'], level=0)
>>>
>>> dff = pd.concat([totals, df2, df3], axis=1)
>>>
>>> print(dff)
      Count                Percent          
        all apple orange     apple    orange
Name                                        
alice     2     2      0  1.000000  0.000000
bob       3     2      1  0.666667  0.333333
dave      1     0      1  0.000000  1.000000

Generating Count Table in Python pandas

Answers (2)

Related Questions