Reputation: 154
I have a dataframe like this:
pd.DataFrame([{"order_id": 1234, "product": "milk"},
{"order_id": 1234, "product": "butter"},
{"order_id": 4321, "product": "bread"},
{"order_id": 4321, "product": "milk"},
{"order_id": 4321, "product": "butter"},
{"order_id": 1111, "product": "corn"},
{"order_id": 1111, "product": "cereal"},
{"order_id": 8888, "product": "milk"}])
order_id product
0 1234 milk
1 1234 butter
2 4321 bread
3 4321 milk
4 4321 butter
5 1111 corn
6 1111 cereal
7 8888 milk
I need to find the most common combination of products, without having to infer how many products to put in these combinations.
This example shoud return milk and butter as the two most bought together itens.
I have tried grouping them by the order_id but could not find a solution to get the combinations inside the groups.
Upvotes: 2
Views: 1042
Reputation: 294218
itertools.combinations
and pandas.Series.mode
from itertools import combinations
pd.Series.mode([
t for _, d in df.groupby('order_id').product
for t in combinations(d, 2)
])
0 (milk, butter)
dtype: object
collections.Counter
Similar answer as above but using Counter
instead of pandas.Series.mode
from itertools import combinations
from collections import Counter
Counter([
t for _, d in df.groupby('order_id').product
for t in combinations(d, 2)
]).most_common(1)
[(('milk', 'butter'), 2)]
Upvotes: 1
Reputation: 150735
We can do find pairs of products by merge
and groupby.size
:
# merge on id to pair up the products
new_df = df.merge(df, on='order_id')
# first thing is to drop identical products
(new_df[new_df['product_x'].lt(new_df['product_y'])]
.groupby(['order_id', 'product_x', 'product_y']) # group
.size() # count (id, prod1, prod2)
.sum(level=[1,2]) # sum over (prod1, prod2)
.idxmax() # get (prod1, prod2) with max count
)
gives you
('butter', 'milk')
Upvotes: 3