SCool
SCool

Reputation: 3375

Add index of value_counts() to new column for each group in groupby

There have been some variations of this question already but I can't find the exact thing I'm looking for.

I have a customer dataframe with the products they've purchased:

   customer product
0      John    Milk
1      John    Milk
2      John  Shoes 
3      John  Shoes 
4      John  Shoes 
5      John   Bread
6      Mary    Milk
7      Mary    Milk
8      Mary    Milk
9      Mary    Milk
10     Mary    Milk
11     Mary    Milk
12     Mary  Shoes 
13     Mary  Shoes 
14      Joe   Bread
15      Joe   Bread
16      Joe   Bread
17      Joe   Bread
18      Joe    Milk
19      Joe    Milk
20      Joe    Milk
21      Joe   Fruit
22      Joe   Fruit
23      Joe  Shoes 
24      Joe  Shoes 
25      Joe  Shoes 
26      Joe    Beer
27      Joe    Beer
28      Joe    Beer
29      Joe    Beer

Note: In the full dataframe the customers have hundreds of products so it can't just be a simple join / set of products in the column.

I would like get the top 5 products ( value_counts() ) for each customer and have the product names in a separate column. Not the count of the products, just the product names (the index of the value couts), and in the correct order of the value counts.

My goal is to have a dataframe like this:

  customer                        Top 5
0     John             Shoes Milk Bread
1     Mary                   Milk Shoes
2      Joe  Bread Beer Milk Shoes Fruit

I think I am on the right track with this line, I can see each customers Top 5 products:

newdf.groupby('customer')['product'].value_counts()

customer  product
Joe       Beer       4
          Bread      4
          Milk       3
          Shoes      3
          Fruit      2
John      Shoes      3
          Milk       2
          Bread      1
Mary      Milk       6
          Shoes      2

I am just not able to extract the information from this dataframe in the format I like. I have tried resetting the index and slicing etc. but I can't seem to get it right.

Upvotes: 2

Views: 930

Answers (4)

jezrael
jezrael

Reputation: 862611

Use Series.value_counts in lambda function for join top 5 values of index with GroupBy.agg:

f = lambda x: ' '.join(x.value_counts().index[:5])
df = (newdf.groupby('customer')['product'].agg(f)
           .reset_index(name='Top 5'))
print (df)  
 customer                         Top 5
0      Joe  Bread Beer Shoes Milk Fruit
1     John             Shoes Milk Bread
2     Mary                   Milk Shoes

If ordering of customer is important:

f = lambda x: ' '.join(x.value_counts().index[:5])
df = (newdf.groupby('customer', sort=False)['product'].agg(f)
           .reset_index(name='Top 5'))
print (df)  
  customer                        Top 5
0     John             Shoes Milk Bread
1     Mary                   Milk Shoes
2      Joe  Bread Beer Shoes Milk Fruit

Upvotes: 2

BENY
BENY

Reputation: 323226

Chain the function together

s=df.groupby(['customer','product']).size().sort_values(ascending=Fasle).\
     groupby(level=0).head(5).reset_index().\
     groupby('customer').product.agg(','.join)
customer
Joe     Bread,Beer,Shoes,Milk,Fruit
John               Shoes,Milk,Bread
Mary                     Milk,Shoes
Name: product, dtype: object

Upvotes: 2

Ian
Ian

Reputation: 3898

Try this:

fn = lambda _: _.value_counts().index.to_list()
df.groupby('customer')['product'].apply(fn).reset_index()

If you want Top 5 as a string instead of a set:

outdf['Top 5'] = outdf['product'].str.join(', ')

Upvotes: 2

Hryhorii Pavlenko
Hryhorii Pavlenko

Reputation: 3910

from collections import Counter


def fn(s):
    return ", ".join(k for k, v in Counter(s).most_common(5))

df.groupby("customer")["product"].apply(fn)
customer
Joe     Bread, Beer, Milk, Shoes, Fruit
John                 Shoes, Milk, Bread
Mary                        Milk, Shoes
Name: product, dtype: object

Upvotes: 2

Related Questions