Syed Md Ismail
Syed Md Ismail

Reputation: 898

Find euclidean / cosine distance between a tensor and all tensors stored in a column of dataframe efficently

I have a tensor 'input_sentence_embed' with shape torch.Size([1, 768])

There is a dataframe 'matched_df' which looks like

   INCIDENT_NUMBER           enc_rep           
0  INC000030884498      [[tensor(-0.2556), tensor(0.0188), tensor(0.02...
1  INC000029956111      [[tensor(-0.3115), tensor(0.2535), tensor(0.20..
2  INC000029555353      [[tensor(-0.3082), tensor(0.2814), tensor(0.24...
3  INC000029555338      [[tensor(-0.2759), tensor(0.2604), tensor(0.21...

Shape of each tensor element in dataframe looks like

 matched_df['enc_rep'].iloc[0].size()
 torch.Size([1, 768])

I want to find euclidean / cosine similarity between 'input_sentence_embed' and each row of 'matched_df' efficently.

If they were scalar values, I could have easily broadcasted 'input_sentence_embed' as a new column in 'matched_df' and then find cosine similarity between two columns.

I am struggling with two problems

  1. How to broadcast 'input_sentence_embed' as a new column to the 'matched_df'
  2. How to find cosine similarity between tensors stored in two column

May be someone can also suggest me other easier methods to achieve the end goal of finding similarity between a tensor value and all tensors stored in a column of dataframe efficently.

Upvotes: 0

Views: 1123

Answers (2)

Corralien
Corralien

Reputation: 120509

Input data:

import pandas as pd
import numpy as np
from torch import tensor

match_df = pd.DataFrame({'INCIDENT_NUMBER': ['INC000030884498',
  'INC000029956111',
  'INC000029555353',
  'INC000029555338'],
 'enc_rep': [[[tensor(0.2971), tensor(0.4831), tensor(0.8239), tensor(0.2048)]],
  [[tensor(0.3481), tensor(0.8104) , tensor(0.2879), tensor(0.9747)]],
  [[tensor(0.2210), tensor(0.3478), tensor(0.2619), tensor(0.2429)]],
  [[tensor(0.2951), tensor(0.6698), tensor(0.9654), tensor(0.5733)]]]})

input_sentence_embed = [[tensor(0.0590), tensor(0.3919), tensor(0.7821) , tensor(0.1967)]]
  1. How to broadcast 'input_sentence_embed' as a new column to the 'matched_df'
match_df["input_sentence_embed"] = [input_sentence_embed] * len(match_df)
  1. How to find cosine similarity between tensors stored in two column
a = np.vstack(match_df["enc_rep"])
b = np.hstack(input_sentence_embed)
match_df["cosine_similarity"] = a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))

Output result:

   INCIDENT_NUMBER                                            enc_rep                               input_sentence_embed  cosine_similarity
0  INC000030884498  [[tensor(0.2971), tensor(0.4831), tensor(0.823...  [[tensor(0.0590), tensor(0.3919), tensor(0.782...           0.446067
1  INC000029956111  [[tensor(0.3481), tensor(0.8104), tensor(0.287...  [[tensor(0.0590), tensor(0.3919), tensor(0.782...           0.377775
2  INC000029555353  [[tensor(0.2210), tensor(0.3478), tensor(0.261...  [[tensor(0.0590), tensor(0.3919), tensor(0.782...           0.201116
3  INC000029555338  [[tensor(0.2951), tensor(0.6698), tensor(0.965...  [[tensor(0.0590), tensor(0.3919), tensor(0.782...           0.574257

Upvotes: 1

mon
mon

Reputation: 22356

Basics

I suppose you are trying to calculate the similarity or closeness of two vectors via:

  • euclidean distance between vectors or
  • cosine between vectors

enter image description here

Cosine similarity

For cosine similarity, you need:

  1. Norm of each vector -> You can use linalg.norm
  2. Cosine of vectors -> You can use dot product (inner or dot)

https://en.wikipedia.org/wiki/Cosine_similarity

enter image description here

For instance A = [0.8, 0.9] and B = [1.0, 0.0], then the cosine similarity of A and B is:

A = np.array([0.8, 0.9])
B = np.array([1.0, 0.0])

EA = np.linalg.norm(A)
EB = np.linalg.norm(B)
NA = A / EA
NB = B / EB

COS_A_B = np.dot(NA, NB)
COS_A_B
---
0.6643638388299198

So if we can get get two vectors (rows) A and B from the enc_rep column, then we can calculate the cosine between them.

Pandas

We need to figure out how to run those cosine calculations on the same column.

C = np.array([0.5, 0.3])

df = pd.DataFrame(columns=['ID','enc_rep'])
df.loc[0] = [1, A]
df.loc[1] = [2, B]
df.loc[2] = [3, C]
df
---
    ID  enc_rep
0   1   [0.8, 0.9]
1   2   [1.0, 0.0]
2   3   [0.5, 0.3]

One naive way is to create a cartesian product of the enc_rep column itself.

cartesian_df = df['enc_rep'].to_frame().merge(df['enc_rep'], how='cross')
cartesian_df
---
    enc_rep_x   enc_rep_y
0   [0.8, 0.9]  [0.8, 0.9]
1   [0.8, 0.9]  [1.0, 0.0]
2   [0.8, 0.9]  [0.5, 0.3]
3   [1.0, 0.0]  [0.8, 0.9]
4   [1.0, 0.0]  [1.0, 0.0]
5   [1.0, 0.0]  [0.5, 0.3]
6   [0.5, 0.3]  [0.8, 0.9]
7   [0.5, 0.3]  [1.0, 0.0]
8   [0.5, 0.3]  [0.5, 0.3]

Take the cosine between enc_rep_x and enc_rep_y.

def f(x, y):
    nx = x / np.linalg.norm(x)
    ny = y / np.linalg.norm(y)
    return np.dot(nx, ny)

cartesian_df['cosine'] = cartesian_df.apply(lambda row: f(row.enc_rep_x, row.enc_rep_y), axis=1)
cartesian_df
---
enc_rep_x   enc_rep_y   cosine
0   [0.8, 0.9]  [0.8, 0.9]  1.000000
1   [0.8, 0.9]  [1.0, 0.0]  0.664364
2   [0.8, 0.9]  [0.5, 0.3]  0.954226
3   [1.0, 0.0]  [0.8, 0.9]  0.664364
4   [1.0, 0.0]  [1.0, 0.0]  1.000000
5   [1.0, 0.0]  [0.5, 0.3]  0.857493
6   [0.5, 0.3]  [0.8, 0.9]  0.954226
7   [0.5, 0.3]  [1.0, 0.0]  0.857493
8   [0.5, 0.3]  [0.5, 0.3]  1.000000

However, if the number of rows are large, it will create a huge dataframe with duplicates. If the size is not an issue, then you can drop one column and take unique rows.

Hope this gives an idea on how. Regarding the details of the shape is 2 dimension vs 1, etc, please figure them out on your own.

Upvotes: 1

Related Questions