Reputation: 898
I have a tensor 'input_sentence_embed' with shape torch.Size([1, 768])
There is a dataframe 'matched_df' which looks like
INCIDENT_NUMBER enc_rep
0 INC000030884498 [[tensor(-0.2556), tensor(0.0188), tensor(0.02...
1 INC000029956111 [[tensor(-0.3115), tensor(0.2535), tensor(0.20..
2 INC000029555353 [[tensor(-0.3082), tensor(0.2814), tensor(0.24...
3 INC000029555338 [[tensor(-0.2759), tensor(0.2604), tensor(0.21...
Shape of each tensor element in dataframe looks like
matched_df['enc_rep'].iloc[0].size()
torch.Size([1, 768])
I want to find euclidean / cosine similarity between 'input_sentence_embed' and each row of 'matched_df' efficently.
If they were scalar values, I could have easily broadcasted 'input_sentence_embed' as a new column in 'matched_df' and then find cosine similarity between two columns.
I am struggling with two problems
May be someone can also suggest me other easier methods to achieve the end goal of finding similarity between a tensor value and all tensors stored in a column of dataframe efficently.
Upvotes: 0
Views: 1123
Reputation: 120509
Input data:
import pandas as pd
import numpy as np
from torch import tensor
match_df = pd.DataFrame({'INCIDENT_NUMBER': ['INC000030884498',
'INC000029956111',
'INC000029555353',
'INC000029555338'],
'enc_rep': [[[tensor(0.2971), tensor(0.4831), tensor(0.8239), tensor(0.2048)]],
[[tensor(0.3481), tensor(0.8104) , tensor(0.2879), tensor(0.9747)]],
[[tensor(0.2210), tensor(0.3478), tensor(0.2619), tensor(0.2429)]],
[[tensor(0.2951), tensor(0.6698), tensor(0.9654), tensor(0.5733)]]]})
input_sentence_embed = [[tensor(0.0590), tensor(0.3919), tensor(0.7821) , tensor(0.1967)]]
match_df["input_sentence_embed"] = [input_sentence_embed] * len(match_df)
a = np.vstack(match_df["enc_rep"])
b = np.hstack(input_sentence_embed)
match_df["cosine_similarity"] = a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))
Output result:
INCIDENT_NUMBER enc_rep input_sentence_embed cosine_similarity
0 INC000030884498 [[tensor(0.2971), tensor(0.4831), tensor(0.823... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.446067
1 INC000029956111 [[tensor(0.3481), tensor(0.8104), tensor(0.287... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.377775
2 INC000029555353 [[tensor(0.2210), tensor(0.3478), tensor(0.261... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.201116
3 INC000029555338 [[tensor(0.2951), tensor(0.6698), tensor(0.965... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.574257
Upvotes: 1
Reputation: 22356
I suppose you are trying to calculate the similarity or closeness of two vectors via:
For cosine similarity, you need:
https://en.wikipedia.org/wiki/Cosine_similarity
For instance A = [0.8, 0.9]
and B = [1.0, 0.0]
, then the cosine similarity of A and B is:
A = np.array([0.8, 0.9])
B = np.array([1.0, 0.0])
EA = np.linalg.norm(A)
EB = np.linalg.norm(B)
NA = A / EA
NB = B / EB
COS_A_B = np.dot(NA, NB)
COS_A_B
---
0.6643638388299198
So if we can get get two vectors (rows) A and B from the enc_rep
column, then we can calculate the cosine between them.
We need to figure out how to run those cosine calculations on the same column.
C = np.array([0.5, 0.3])
df = pd.DataFrame(columns=['ID','enc_rep'])
df.loc[0] = [1, A]
df.loc[1] = [2, B]
df.loc[2] = [3, C]
df
---
ID enc_rep
0 1 [0.8, 0.9]
1 2 [1.0, 0.0]
2 3 [0.5, 0.3]
One naive way is to create a cartesian product of the enc_rep
column itself.
cartesian_df = df['enc_rep'].to_frame().merge(df['enc_rep'], how='cross')
cartesian_df
---
enc_rep_x enc_rep_y
0 [0.8, 0.9] [0.8, 0.9]
1 [0.8, 0.9] [1.0, 0.0]
2 [0.8, 0.9] [0.5, 0.3]
3 [1.0, 0.0] [0.8, 0.9]
4 [1.0, 0.0] [1.0, 0.0]
5 [1.0, 0.0] [0.5, 0.3]
6 [0.5, 0.3] [0.8, 0.9]
7 [0.5, 0.3] [1.0, 0.0]
8 [0.5, 0.3] [0.5, 0.3]
Take the cosine between enc_rep_x
and enc_rep_y
.
def f(x, y):
nx = x / np.linalg.norm(x)
ny = y / np.linalg.norm(y)
return np.dot(nx, ny)
cartesian_df['cosine'] = cartesian_df.apply(lambda row: f(row.enc_rep_x, row.enc_rep_y), axis=1)
cartesian_df
---
enc_rep_x enc_rep_y cosine
0 [0.8, 0.9] [0.8, 0.9] 1.000000
1 [0.8, 0.9] [1.0, 0.0] 0.664364
2 [0.8, 0.9] [0.5, 0.3] 0.954226
3 [1.0, 0.0] [0.8, 0.9] 0.664364
4 [1.0, 0.0] [1.0, 0.0] 1.000000
5 [1.0, 0.0] [0.5, 0.3] 0.857493
6 [0.5, 0.3] [0.8, 0.9] 0.954226
7 [0.5, 0.3] [1.0, 0.0] 0.857493
8 [0.5, 0.3] [0.5, 0.3] 1.000000
However, if the number of rows are large, it will create a huge dataframe with duplicates. If the size is not an issue, then you can drop one column and take unique rows.
Hope this gives an idea on how. Regarding the details of the shape is 2 dimension vs 1, etc, please figure them out on your own.
Upvotes: 1