Qubix
Qubix

Reputation: 4353

Compute element overlap based on another column, pandas

If I have a dataframe of the form:

 tag      element_id
  1          12
  1          13
  1          15
  2          12
  2          13
  2          19
  3          12
  3          15
  3          22

how can I compute the overlaps of the tags in terms of the element_id ? The result I guess should be an overlap matrix of the form:

   1   2   3
1  X   2   2
2  2   X   1
3  2   1   X

where I put X on the diagonal since the overlap of a tag with itself is not relevant and where the numbers in the matrix represent the total element_ids that the two tags share.

My attempts:

You can try and use a for loop like :

for item in df.itertuples():
    element_lst += [item.element_id]
    element_tag = item.tag
# then intersect the element_list row by row. 
# This is extremely costly for large datasets

The second thing I was thinking about was to use df.groupby('tag') and try to somehow intersect on element_id, but it is not clear to me how I can do that with grouped data.

Upvotes: 2

Views: 185

Answers (1)

ALollz
ALollz

Reputation: 59519

merge + crosstab

# Find element overlap, remove same tag matches
res = df.merge(df, on='element_id').query('tag_x != tag_y')

pd.crosstab(res.tag_x, res.tag_y)

Output:

tag_y  1  2  3
tag_x         
1      0  2  2
2      2  0  1
3      2  1  0

Upvotes: 3

Related Questions