Find euclidean / cosine distance between a tensor and all tensors stored in a column of dataframe efficently

Question

I have a tensor 'input_sentence_embed' with shape torch.Size([1, 768])

There is a dataframe 'matched_df' which looks like

   INCIDENT_NUMBER           enc_rep           
0  INC000030884498      [[tensor(-0.2556), tensor(0.0188), tensor(0.02...
1  INC000029956111      [[tensor(-0.3115), tensor(0.2535), tensor(0.20..
2  INC000029555353      [[tensor(-0.3082), tensor(0.2814), tensor(0.24...
3  INC000029555338      [[tensor(-0.2759), tensor(0.2604), tensor(0.21...

Shape of each tensor element in dataframe looks like

 matched_df['enc_rep'].iloc[0].size()
 torch.Size([1, 768])

I want to find euclidean / cosine similarity between 'input_sentence_embed' and each row of 'matched_df' efficently.

If they were scalar values, I could have easily broadcasted 'input_sentence_embed' as a new column in 'matched_df' and then find cosine similarity between two columns.

I am struggling with two problems

How to broadcast 'input_sentence_embed' as a new column to the 'matched_df'
How to find cosine similarity between tensors stored in two column

May be someone can also suggest me other easier methods to achieve the end goal of finding similarity between a tensor value and all tensors stored in a column of dataframe efficently.

Corralien · Accepted Answer

Input data:

import pandas as pd
import numpy as np
from torch import tensor

match_df = pd.DataFrame({'INCIDENT_NUMBER': ['INC000030884498',
  'INC000029956111',
  'INC000029555353',
  'INC000029555338'],
 'enc_rep': [[[tensor(0.2971), tensor(0.4831), tensor(0.8239), tensor(0.2048)]],
  [[tensor(0.3481), tensor(0.8104) , tensor(0.2879), tensor(0.9747)]],
  [[tensor(0.2210), tensor(0.3478), tensor(0.2619), tensor(0.2429)]],
  [[tensor(0.2951), tensor(0.6698), tensor(0.9654), tensor(0.5733)]]]})

input_sentence_embed = [[tensor(0.0590), tensor(0.3919), tensor(0.7821) , tensor(0.1967)]]

How to broadcast 'input_sentence_embed' as a new column to the 'matched_df'

match_df["input_sentence_embed"] = [input_sentence_embed] * len(match_df)

How to find cosine similarity between tensors stored in two column

a = np.vstack(match_df["enc_rep"])
b = np.hstack(input_sentence_embed)
match_df["cosine_similarity"] = a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))

Output result:

   INCIDENT_NUMBER                                            enc_rep                               input_sentence_embed  cosine_similarity
0  INC000030884498  [[tensor(0.2971), tensor(0.4831), tensor(0.823...  [[tensor(0.0590), tensor(0.3919), tensor(0.782...           0.446067
1  INC000029956111  [[tensor(0.3481), tensor(0.8104), tensor(0.287...  [[tensor(0.0590), tensor(0.3919), tensor(0.782...           0.377775
2  INC000029555353  [[tensor(0.2210), tensor(0.3478), tensor(0.261...  [[tensor(0.0590), tensor(0.3919), tensor(0.782...           0.201116
3  INC000029555338  [[tensor(0.2951), tensor(0.6698), tensor(0.965...  [[tensor(0.0590), tensor(0.3919), tensor(0.782...           0.574257

Find euclidean / cosine distance between a tensor and all tensors stored in a column of dataframe efficently

Answers (2)

Basics

Cosine similarity

Pandas

Related Questions