Reputation: 958
I have a dataframe of user - item combinations.
user item
0 user1 item1
1 user1 item2
2 user1 item3
3 user2 item1
4 user2 item4
5 user3 item1
6 user3 item2
7 user3 item4
What I want to do is get an edge list of items that share the same user (simpler) or a co-occurrence matrix of how often two items share the same user (more complicated). To be more clear, the co-occurrence matrix would show how often two items are bought together.
Here is an example of the Edge list
pair1 pair2
0 item1 item2
1 item2 item3
2 item3 item1
3 item1 item4
4 item1 item4
5 item1 item2
6 item2 item4
Co-occurrence matrix
item1 item2 item3 item4
item1 5 2 1 2
item2 2 4 1 1
item3 1 1 2 0
item4 2 1 0 3
Upvotes: 1
Views: 797
Reputation: 879611
We can generated the edge list using groupby/apply
and
itertools.combinations
to generate all pairs for each group.
To generate the cooccurrence matrix, we can start by using pd.crosstab
to
compute a frequency table. Since this result is upper triangular and the desired
matrix is symmetric, we can add its transpose to make it symmetric. The
diagonals appear to be the sum of other items in each row. Filling in these
values using pandas requires a for-loop. Alternatively, we can modify the
underlying NumPy array and then rebuild the DataFrame from this modified array.
import itertools as IT
import numpy as np
import pandas as pd
df = pd.DataFrame({
'item': ['item1', 'item2', 'item3', 'item1', 'item4', 'item1', 'item2', 'item4'],
'user': ['user1', 'user1', 'user1', 'user2', 'user2', 'user3', 'user3', 'user3']})
edges = df.groupby(['user'], group_keys=False).apply(
lambda x: pd.DataFrame(list(IT.combinations(x['item'], 2)),
columns=['first', 'second'])).reset_index(drop=True)
print(edges)
yields
first second
0 item1 item2
1 item1 item3
2 item2 item3
3 item1 item4
4 item1 item2
5 item1 item4
6 item2 item4
cooccurrence = pd.crosstab(index=[edges['first']], columns=[edges['second']])
items = df['item'].unique()
cooccurrence = cooccurrence.reindex(index=items, columns=items)
cooccurrence = cooccurrence.add(cooccurrence.T, fill_value=0)
cooccurrence = cooccurrence.fillna(0)
diagvals = cooccurrence.sum(axis=0)
arr = cooccurrence.values
i = np.arange(len(diagvals))
arr[i,i] = diagvals
cooccurrence = pd.DataFrame(arr, columns=cooccurrence.columns,
index=cooccurrence.index)
print(cooccurrence)
yields
second item1 item2 item3 item4
first
item1 5 2 1 2
item2 2 4 1 1
item3 1 1 2 0
item4 2 1 0 3
Upvotes: 2
Reputation: 380
I am not too familiar with an edge list, but it looks like all you are doing is chaining the item column of the raw data frame into rows of a new data frame. For the edge list you could try to iterate through the item column and write each element of the link into new rows of the new data frame. Maybe a nested for loop for iterating would work.
For the co-occurence matrix, you could use pandas.cut to bin the data and obtain a frequency distribution list of each items co-occurence. Then you populate an array with each frequency.
I haven't used pandas in a while, but I will see what I can do about a code solution.
Upvotes: 0