Reputation: 3375
There have been some variations of this question already but I can't find the exact thing I'm looking for.
I have a customer dataframe with the products they've purchased:
customer product
0 John Milk
1 John Milk
2 John Shoes
3 John Shoes
4 John Shoes
5 John Bread
6 Mary Milk
7 Mary Milk
8 Mary Milk
9 Mary Milk
10 Mary Milk
11 Mary Milk
12 Mary Shoes
13 Mary Shoes
14 Joe Bread
15 Joe Bread
16 Joe Bread
17 Joe Bread
18 Joe Milk
19 Joe Milk
20 Joe Milk
21 Joe Fruit
22 Joe Fruit
23 Joe Shoes
24 Joe Shoes
25 Joe Shoes
26 Joe Beer
27 Joe Beer
28 Joe Beer
29 Joe Beer
Note: In the full dataframe the customers have hundreds of products so it can't just be a simple join / set of products in the column.
I would like get the top 5 products ( value_counts()
) for each customer and have the product names in a separate column. Not the count of the products, just the product names (the index of the value couts), and in the correct order of the value counts.
My goal is to have a dataframe like this:
customer Top 5
0 John Shoes Milk Bread
1 Mary Milk Shoes
2 Joe Bread Beer Milk Shoes Fruit
I think I am on the right track with this line, I can see each customers Top 5 products:
newdf.groupby('customer')['product'].value_counts()
customer product
Joe Beer 4
Bread 4
Milk 3
Shoes 3
Fruit 2
John Shoes 3
Milk 2
Bread 1
Mary Milk 6
Shoes 2
I am just not able to extract the information from this dataframe in the format I like. I have tried resetting the index and slicing etc. but I can't seem to get it right.
Upvotes: 2
Views: 930
Reputation: 862611
Use Series.value_counts
in lambda function for join top 5 values of index with GroupBy.agg
:
f = lambda x: ' '.join(x.value_counts().index[:5])
df = (newdf.groupby('customer')['product'].agg(f)
.reset_index(name='Top 5'))
print (df)
customer Top 5
0 Joe Bread Beer Shoes Milk Fruit
1 John Shoes Milk Bread
2 Mary Milk Shoes
If ordering of customer is important:
f = lambda x: ' '.join(x.value_counts().index[:5])
df = (newdf.groupby('customer', sort=False)['product'].agg(f)
.reset_index(name='Top 5'))
print (df)
customer Top 5
0 John Shoes Milk Bread
1 Mary Milk Shoes
2 Joe Bread Beer Shoes Milk Fruit
Upvotes: 2
Reputation: 323226
Chain the function together
s=df.groupby(['customer','product']).size().sort_values(ascending=Fasle).\
groupby(level=0).head(5).reset_index().\
groupby('customer').product.agg(','.join)
customer
Joe Bread,Beer,Shoes,Milk,Fruit
John Shoes,Milk,Bread
Mary Milk,Shoes
Name: product, dtype: object
Upvotes: 2
Reputation: 3898
Try this:
fn = lambda _: _.value_counts().index.to_list()
df.groupby('customer')['product'].apply(fn).reset_index()
If you want Top 5
as a string instead of a set:
outdf['Top 5'] = outdf['product'].str.join(', ')
Upvotes: 2
Reputation: 3910
from collections import Counter
def fn(s):
return ", ".join(k for k, v in Counter(s).most_common(5))
df.groupby("customer")["product"].apply(fn)
customer
Joe Bread, Beer, Milk, Shoes, Fruit
John Shoes, Milk, Bread
Mary Milk, Shoes
Name: product, dtype: object
Upvotes: 2