Claudio Davi
Claudio Davi

Reputation: 154

How do I find the most common combinations of items on a grouped dataframe?

I have a dataframe like this:

pd.DataFrame([{"order_id": 1234, "product": "milk"},
              {"order_id": 1234, "product": "butter"},
             {"order_id": 4321, "product": "bread"}, 
             {"order_id": 4321, "product": "milk"},
             {"order_id": 4321, "product": "butter"},
             {"order_id": 1111, "product": "corn"},
             {"order_id": 1111, "product": "cereal"},
             {"order_id": 8888, "product": "milk"}])

    order_id    product
0   1234    milk
1   1234    butter
2   4321    bread
3   4321    milk
4   4321    butter
5   1111    corn
6   1111    cereal
7   8888    milk

I need to find the most common combination of products, without having to infer how many products to put in these combinations.

This example shoud return milk and butter as the two most bought together itens.

I have tried grouping them by the order_id but could not find a solution to get the combinations inside the groups.

Upvotes: 2

Views: 1042

Answers (2)

piRSquared
piRSquared

Reputation: 294218

itertools.combinations and pandas.Series.mode

from itertools import combinations

pd.Series.mode([
    t for _, d in df.groupby('order_id').product
    for t in combinations(d, 2)
])

0    (milk, butter)
dtype: object

collections.Counter

Similar answer as above but using Counter instead of pandas.Series.mode

from itertools import combinations
from collections import Counter

Counter([
    t for _, d in df.groupby('order_id').product
    for t in combinations(d, 2)
]).most_common(1)

[(('milk', 'butter'), 2)]

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150735

We can do find pairs of products by merge and groupby.size:

# merge on id to pair up the products
new_df = df.merge(df, on='order_id')

# first thing is to drop identical products
(new_df[new_df['product_x'].lt(new_df['product_y'])]
    .groupby(['order_id', 'product_x', 'product_y'])              # group
    .size()            # count (id, prod1, prod2)
    .sum(level=[1,2])  # sum over (prod1, prod2)
    .idxmax()          # get (prod1, prod2) with max count
)

gives you

('butter', 'milk')

Upvotes: 3

Related Questions