Reputation:
I would like to calculate how often an item appears in a shopping cart.
I have a purchase recognizable by the buyerid
. This buyerid can buy several items (also twice, triple,..., n-th times). Recognizable by itemid
and description
.
I would like to count the number of times an item ends up in a shopping cart. For example, out of 5 purchases, 3 people bought an apple, i.e. 0.6%. I would like to spend this on all products, how do I do that?
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
display(df.head(20))
My try:
# How many % of the articels are the same?
# this is wrong... :/
df_grouped = df.groupby('description').count()
display(df_grouped)
df_apple = df_grouped.iloc[0]
percentage = df_apple[0] / df.shape[0]
print(percentage)
[OUT] 0.45454545454545453
The mathematic formula
count of all buys (count_buy ) = 5
count how many an apple appears in the buy (count_apple) = 3
count_buy /count_apple = 3 / 5 = 0.6
What I would like to have (please note, I have not calculated the values, these are just dumy values)
Upvotes: 1
Views: 100
Reputation: 862481
Use GroupBy.size
and divide by count of unique values of buyerid
by Series.nunique
:
print (df.groupby(['itemid','description']).size())
itemid description
0 Banana 1
1 Apple 3
Banana 2
Strawberry 1
2 Apple 2
4 Dog-Food 1
5 Beef 1
dtype: int64
purch = df['buyerid'].nunique()
df1 = df.groupby(['itemid','description']).size().div(purch).reset_index(name='percentage')
print (df1)
itemid description percentage
0 0 Banana 0.2
1 1 Apple 0.6
2 1 Banana 0.4
3 1 Strawberry 0.2
4 2 Apple 0.4
5 4 Dog-Food 0.2
6 5 Beef 0.2
Upvotes: 1
Reputation: 1840
As always, there are multiple ways to the gold, but i would go over pivoting as following: Your input:
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
In a next step pivot the data with buyer_id
as index
and description
as columns and replace NA with 0 as such
df2 = df.pivot_table(values='itemid', index='buyerid', columns='description', aggfunc='count')
df2 = df2.fillna(0)
resulting in
description Apple Banana Beef Dog-Food Strawberry
buyerid
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0
3 2.0 1.0 0.0 0.0 0.0
4 1.0 1.0 1.0 1.0 0.0
calling the mean on the table:
df_final = df2.mean()
results in
description
Apple 1.0
Banana 0.6
Beef 0.2
Dog-Food 0.2
Strawberry 0.2
dtype: float64
Upvotes: 0
Reputation: 942
I would group it and create a new column as follows:
df_grp = df.groupby('description')['buyerid'].sum().reset_index(name='total')
df_grp['percentage'] = (df_grp.total / df_grp.total.sum()) * 100
df_grp
Result:
description total percentage
0 Apple 11 39.285714
1 Banana 7 25.000000
2 Beef 4 14.285714
3 Dog-Food 4 14.285714
4 Strawberry 2 7.142857
Upvotes: 0