Pandas combine rows that share an association

Question

I have a dataframe of user - item combinations.

    user    item
0   user1   item1
1   user1   item2
2   user1   item3
3   user2   item1
4   user2   item4
5   user3   item1
6   user3   item2
7   user3   item4

What I want to do is get an edge list of items that share the same user (simpler) or a co-occurrence matrix of how often two items share the same user (more complicated). To be more clear, the co-occurrence matrix would show how often two items are bought together.

Here is an example of the Edge list

    pair1   pair2
0   item1   item2
1   item2   item3
2   item3   item1
3   item1   item4
4   item1   item4
5   item1   item2
6   item2   item4

Co-occurrence matrix

         item1  item2   item3   item4
item1      5      2       1       2
item2      2      4       1       1
item3      1      1       2       0
item4      2      1       0       3

unutbu · Accepted Answer

We can generated the edge list using groupby/apply and itertools.combinations to generate all pairs for each group.

To generate the cooccurrence matrix, we can start by using pd.crosstab to compute a frequency table. Since this result is upper triangular and the desired matrix is symmetric, we can add its transpose to make it symmetric. The diagonals appear to be the sum of other items in each row. Filling in these values using pandas requires a for-loop. Alternatively, we can modify the underlying NumPy array and then rebuild the DataFrame from this modified array.

import itertools as IT
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'item': ['item1', 'item2', 'item3', 'item1', 'item4', 'item1', 'item2', 'item4'],
    'user': ['user1', 'user1', 'user1', 'user2', 'user2', 'user3', 'user3', 'user3']})
edges = df.groupby(['user'], group_keys=False).apply(
    lambda x: pd.DataFrame(list(IT.combinations(x['item'], 2)), 
                           columns=['first', 'second'])).reset_index(drop=True)
print(edges)

yields

   first second
0  item1  item2
1  item1  item3
2  item2  item3
3  item1  item4
4  item1  item2
5  item1  item4
6  item2  item4

cooccurrence = pd.crosstab(index=[edges['first']], columns=[edges['second']])
items = df['item'].unique()
cooccurrence = cooccurrence.reindex(index=items, columns=items)
cooccurrence = cooccurrence.add(cooccurrence.T, fill_value=0)
cooccurrence = cooccurrence.fillna(0)
diagvals = cooccurrence.sum(axis=0)
arr = cooccurrence.values
i = np.arange(len(diagvals))
arr[i,i] = diagvals
cooccurrence = pd.DataFrame(arr, columns=cooccurrence.columns,
                            index=cooccurrence.index)
print(cooccurrence)

yields

second  item1  item2  item3  item4
first                             
item1       5      2      1      2
item2       2      4      1      1
item3       1      1      2      0
item4       2      1      0      3

Pandas combine rows that share an association

Answers (2)

Related Questions