Summarize the data to count occurence of combinations

Question

I have data for puchases like below

User        Product
u1          p1
u1          p2
u2          p1
u3          p5
...         ...
...         ...
...         ...

where users is large (order of 10k) and p's are small.

I want to compute a matrix type structure indicating how many people purchased p1, p1 and p2, p1 and p3 etc like below

   p1 p2 p3 p4 p5 Total
p1 1  1           2
p2 1              1
p3
p4
p5             1  1

I really could not find a way to achieve this..What I have attempted is this..

pd.pivot_table(df,index='Product',columns='Product').fillna('')

        Num
Product p1  p2  p5
Product         
p1      1.0 
p2          1.0 
p5              1.0

what is the standard way to do this?

Arne · Accepted Answer

There is probably a more clever way to do this, but here is a straightforward approach:

import numpy as np
import pandas as pd

df = pd.DataFrame([['u1', 'p1'],
                   ['u1', 'p2'],
                   ['u2', 'p1'],
                   ['u3', 'p5']],
                  columns=['User', 'Product'])

products = sorted(list(set(df.Product)))
n_products = len(products)
product_pairs = np.zeros(shape=(n_products, n_products),
                         dtype=int)

for cart in df.groupby('User')['Product'].apply(set):
    for row, p1 in enumerate(products):
        if p1 in cart:
            for col, p2 in enumerate(products):
                if p2 in cart:
                    product_pairs[row, col] += 1

pd.DataFrame(product_pairs, index=products, 
             columns=products)

    p1  p2  p5
p1  2   1   0
p2  1   1   0
p5  0   0   1

Summarize the data to count occurence of combinations

Answers (1)

Related Questions