Reputation: 45
I am trying to convert a pandas dataframe into one of cumulative counts/percentages. My current dataframe looks like this:
Name Purchase
alice apple
bob orange
dave orange
bob apple
bob apple
alice apple
And I'd like to turn this table into one that shows the counts as follows:
Name tot-purchases num-apple percent-apple
alice 2 2 100
bob 3 2 66.67
dave 1 0 0
I know I can use df['Name'].value_counts() to get the 'tot-purchases', but I can't figure out how to use 'Name' as a key for the other two columns. I'm just lacking a basic understanding of what groupby and the selection [] operator does on dataframes. For instance, I would have thought that this would give me a DataFrame of just instances with apples, but that's giving me an invalid type comparison:
df[df['Purchase'] == 'apple']['Name']]
Upvotes: 1
Views: 8304
Reputation: 2145
I wrote a little function to do this.
Pass a data frame to df and the column that's being operated on to column.
def fruits(df,column): # column needs to be string
df['tot-purchases'] = 1
for item in df[column].unique():
df['num-%s' % item] = df[column].apply(lambda value: 1 if value == str(item) else 0)
new_data = data.groupby('Name').sum()
cols = [col for col in new_data.columns if 'num' in col]
for col in cols:
new_data[col.replace('num','percent')] = new_data[col] / new_data['tot-purchases'] * 100
return new_data
Output:
In [73]: data
Out[73]:
Name Purchase
0 alice apple
1 bob orange
2 dave orange
3 bob apple
4 bob apple
5 alice apple
In [74]: print fruits(data, 'Purchase')
tot-purchases num-apple num-orange percent-apple percent-orange
Name
alice 2 2 0 100.000000 0.000000
bob 3 2 1 66.666667 33.333333
dave 1 0 1 0.000000 100.000000
Upvotes: 1
Reputation: 10298
Something like this should work. It will be faster than apply
since it uses vectorized operations. Further, rather than hard-coding the apple
result, it gives you counts and percentages for all purchases, no matter how many you might have.
>>> df2 = df.reset_index().groupby(['Name', 'Purchase']).count().unstack('Purchase').fillna(0)
>>> df2.columns = df2.columns.droplevel(0)
>>>
>>> totals = df2.sum(axis=1)
>>> totals.name = 'tot-purchases'
>>>
>>> df3 = df2.divide(df2.sum(axis=1), axis=0)
>>>
>>> df2.columns = ['num-'+x for x in df2.columns]
>>> df3.columns = ['percent-'+x for x in df3.columns]
>>> dff = pd.concat([totals, df2, df3], axis=1)
>>>
>>> print(dff)
tot-purchases num-apple num-orange percent-apple percent-orange
Name
alice 2 2 0 1.000000 0.000000
bob 3 2 1 0.666667 0.333333
dave 1 0 1 0.000000 1.000000
>>> print(dff.loc[:,('tot-purchases', 'num-apple', 'percent-apple')])
tot-purchases num-apple percent-apple
Name
alice 2 2 1.000000
bob 3 2 0.666667
dave 1 0 0.000000
Basically it divides the data up into groups by Name
and Purchase
, then counts how many are in each group. Then it sets the Purchase
name to be a column heading, giving you a 2D DataFrame
where the index
is Name
, the columns
are the Purchase
type, and the values are the counts of that Purchase
type for the given Name
. Then it is a matter of arithmetic to get percentages and totals.
If you are willing to change the output a bit, you can do something even more useful with a MultiIndex
:
>>> df2 = df.reset_index().groupby(['Name', 'Purchase']).count().unstack('Purchase').fillna(0)
>>> df2.columns.rename('Value',level=0, inplace=True)
>>> df2.columns = df2.columns.set_levels(['Count'], level=0)
>>>
>>> totals = df2.sum(axis=1)
>>> totals.name = ('Count', 'all')
>>>
>>> df3 = df2.divide(df2.sum(axis=1), axis=0)
>>> df3.columns = df3.columns.set_levels(['Percent'], level=0)
>>>
>>> dff = pd.concat([totals, df2, df3], axis=1)
>>>
>>> print(dff)
Count Percent
all apple orange apple orange
Name
alice 2 2 0 1.000000 0.000000
bob 3 2 1 0.666667 0.333333
dave 1 0 1 0.000000 1.000000
Upvotes: 2