Compute the intersection of lists for each pair of values in a column

Question

If I have a data set that has 2 columns user_id and their interests and I want to find users having common interests, how can I do that? For example, I will take the first user and his interests and compare it with all other user's common interests individually, then I will take the second user and compare his interests with all other user's interests and so on....

My data looks like:

userid   interest
 1       [A, B]
 2       [A, C, B]
 3       [B, D]

I am not sure how to do this-

for i in range(0,3):
  for j in range(i+1, 3):
    print((df['interest'].loc[i]).intersection(df['interest'].loc[j]))

My output should be-

userid    relativeid  common interest
  1          2           [A, B]
  1          3           [B]
  2          3           [B]

cs95 · Accepted Answer

Use a dictionary to perform lookup. You can then find combinations of "userid" using itertools.combinations and then just perform set intersection for each "userid' list pair.

import itertools

m = df.set_index('userid')['interest'].map(set).to_dict()
m 
# {1: {'A', 'B'}, 2: {'A', 'B', 'C'}, 3: {'B', 'D'}}

out = pd.DataFrame(
    itertools.combinations(df.userid, 2), columns=['userid', 'relativeid'])
out['common_interest'] = [list(m[x] & m[y]) for x, y in out.values]
out

   userid  relativeid common_interest
0       1           2          [B, A]
1       1           3             [B]
2       2           3             [B]

Compute the intersection of lists for each pair of values in a column

Answers (2)

Related Questions