Guerre
Guerre

Reputation: 45

Generating Count Table in Python pandas

I am trying to convert a pandas dataframe into one of cumulative counts/percentages. My current dataframe looks like this:

Name    Purchase
alice   apple
bob     orange
dave    orange
bob     apple
bob     apple
alice   apple

And I'd like to turn this table into one that shows the counts as follows:

Name    tot-purchases    num-apple    percent-apple
alice   2               2            100
bob     3               2            66.67
dave    1               0            0

I know I can use df['Name'].value_counts() to get the 'tot-purchases', but I can't figure out how to use 'Name' as a key for the other two columns. I'm just lacking a basic understanding of what groupby and the selection [] operator does on dataframes. For instance, I would have thought that this would give me a DataFrame of just instances with apples, but that's giving me an invalid type comparison:

df[df['Purchase'] == 'apple']['Name']]

Upvotes: 1

Views: 8304

Answers (2)

kennes
kennes

Reputation: 2145

I wrote a little function to do this.

Pass a data frame to df and the column that's being operated on to column.

def fruits(df,column): # column needs to be string

        df['tot-purchases'] = 1
        for item in df[column].unique():
                df['num-%s' % item] = df[column].apply(lambda value: 1 if value == str(item) else 0)

        new_data = data.groupby('Name').sum()
        cols = [col for col in new_data.columns if 'num' in col]
        for col in cols:
                new_data[col.replace('num','percent')] = new_data[col] / new_data['tot-purchases'] * 100


        return new_data

Output:

In [73]: data
Out[73]: 
    Name Purchase
0  alice    apple
1    bob   orange
2   dave   orange
3    bob    apple
4    bob    apple
5  alice    apple

In [74]: print fruits(data, 'Purchase')
       tot-purchases  num-apple  num-orange  percent-apple  percent-orange
Name                                                                      
alice              2          2           0     100.000000        0.000000
bob                3          2           1      66.666667       33.333333
dave               1          0           1       0.000000      100.000000

Upvotes: 1

TheBlackCat
TheBlackCat

Reputation: 10298

Something like this should work. It will be faster than apply since it uses vectorized operations. Further, rather than hard-coding the apple result, it gives you counts and percentages for all purchases, no matter how many you might have.

>>> df2 = df.reset_index().groupby(['Name', 'Purchase']).count().unstack('Purchase').fillna(0)
>>> df2.columns = df2.columns.droplevel(0)
>>>
>>> totals = df2.sum(axis=1)
>>> totals.name = 'tot-purchases'
>>>
>>> df3 = df2.divide(df2.sum(axis=1), axis=0)
>>>
>>> df2.columns = ['num-'+x for x in df2.columns]
>>> df3.columns = ['percent-'+x for x in df3.columns]
>>> dff = pd.concat([totals, df2, df3], axis=1)
>>>
>>> print(dff)
       tot-purchases  num-apple  num-orange  percent-apple  percent-orange
Name                                                                      
alice              2          2           0       1.000000        0.000000
bob                3          2           1       0.666667        0.333333
dave               1          0           1       0.000000        1.000000
>>> print(dff.loc[:,('tot-purchases', 'num-apple', 'percent-apple')])
       tot-purchases  num-apple  percent-apple
Name                                          
alice              2          2       1.000000
bob                3          2       0.666667
dave               1          0       0.000000

Basically it divides the data up into groups by Name and Purchase, then counts how many are in each group. Then it sets the Purchase name to be a column heading, giving you a 2D DataFrame where the index is Name, the columns are the Purchase type, and the values are the counts of that Purchase type for the given Name. Then it is a matter of arithmetic to get percentages and totals.

If you are willing to change the output a bit, you can do something even more useful with a MultiIndex:

>>> df2 = df.reset_index().groupby(['Name', 'Purchase']).count().unstack('Purchase').fillna(0)
>>> df2.columns.rename('Value',level=0, inplace=True)
>>> df2.columns = df2.columns.set_levels(['Count'], level=0)
>>>
>>> totals = df2.sum(axis=1)
>>> totals.name = ('Count', 'all')
>>>
>>> df3 = df2.divide(df2.sum(axis=1), axis=0)
>>> df3.columns = df3.columns.set_levels(['Percent'], level=0)
>>>
>>> dff = pd.concat([totals, df2, df3], axis=1)
>>>
>>> print(dff)
      Count                Percent          
        all apple orange     apple    orange
Name                                        
alice     2     2      0  1.000000  0.000000
bob       3     2      1  0.666667  0.333333
dave      1     0      1  0.000000  1.000000

Upvotes: 2

Related Questions