grasshopper
grasshopper

Reputation: 958

Pandas combine rows that share an association

I have a dataframe of user - item combinations.

    user    item
0   user1   item1
1   user1   item2
2   user1   item3
3   user2   item1
4   user2   item4
5   user3   item1
6   user3   item2
7   user3   item4

What I want to do is get an edge list of items that share the same user (simpler) or a co-occurrence matrix of how often two items share the same user (more complicated). To be more clear, the co-occurrence matrix would show how often two items are bought together.

Here is an example of the Edge list

    pair1   pair2
0   item1   item2
1   item2   item3
2   item3   item1
3   item1   item4
4   item1   item4
5   item1   item2
6   item2   item4

Co-occurrence matrix

         item1  item2   item3   item4
item1      5      2       1       2
item2      2      4       1       1
item3      1      1       2       0
item4      2      1       0       3

Upvotes: 1

Views: 797

Answers (2)

unutbu
unutbu

Reputation: 879611

We can generated the edge list using groupby/apply and itertools.combinations to generate all pairs for each group.

To generate the cooccurrence matrix, we can start by using pd.crosstab to compute a frequency table. Since this result is upper triangular and the desired matrix is symmetric, we can add its transpose to make it symmetric. The diagonals appear to be the sum of other items in each row. Filling in these values using pandas requires a for-loop. Alternatively, we can modify the underlying NumPy array and then rebuild the DataFrame from this modified array.


import itertools as IT
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'item': ['item1', 'item2', 'item3', 'item1', 'item4', 'item1', 'item2', 'item4'],
    'user': ['user1', 'user1', 'user1', 'user2', 'user2', 'user3', 'user3', 'user3']})
edges = df.groupby(['user'], group_keys=False).apply(
    lambda x: pd.DataFrame(list(IT.combinations(x['item'], 2)), 
                           columns=['first', 'second'])).reset_index(drop=True)
print(edges)

yields

   first second
0  item1  item2
1  item1  item3
2  item2  item3
3  item1  item4
4  item1  item2
5  item1  item4
6  item2  item4

cooccurrence = pd.crosstab(index=[edges['first']], columns=[edges['second']])
items = df['item'].unique()
cooccurrence = cooccurrence.reindex(index=items, columns=items)
cooccurrence = cooccurrence.add(cooccurrence.T, fill_value=0)
cooccurrence = cooccurrence.fillna(0)
diagvals = cooccurrence.sum(axis=0)
arr = cooccurrence.values
i = np.arange(len(diagvals))
arr[i,i] = diagvals
cooccurrence = pd.DataFrame(arr, columns=cooccurrence.columns,
                            index=cooccurrence.index)
print(cooccurrence)

yields

second  item1  item2  item3  item4
first                             
item1       5      2      1      2
item2       2      4      1      1
item3       1      1      2      0
item4       2      1      0      3

Upvotes: 2

small_data88
small_data88

Reputation: 380

I am not too familiar with an edge list, but it looks like all you are doing is chaining the item column of the raw data frame into rows of a new data frame. For the edge list you could try to iterate through the item column and write each element of the link into new rows of the new data frame. Maybe a nested for loop for iterating would work.

For the co-occurence matrix, you could use pandas.cut to bin the data and obtain a frequency distribution list of each items co-occurence. Then you populate an array with each frequency.

I haven't used pandas in a while, but I will see what I can do about a code solution.

Upvotes: 0

Related Questions