Alex Kujur
Alex Kujur

Reputation: 121

Cosine similarity between columns of two different DataFrame

I wanted to compute the cosine similarity between two DataFrame(for a different sizes) and store the result in the new data. The similarity is calculated using BERT embeddings

 df1
title
Lorem ipsum dolor sit amet
Lorem ipsum dolor sit amet
Lorem ipsum dolor sit amet

df2
claim
fact checked claims one
fact checked claims tweet

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
df_1['title_embeddings']=df_1['title'].apply(lambda x: model.encode(x))
df_2['claim_embeddings']=df_2['claim'].apply(lambda x: model.encode(x))

sim_score=[]
text =[]
for i in range(len(df['claim_embeddings'])):
   t=df['title_embeddings'].apply(lambda x: cosine_similarity(x, df['claim_embeddings'][i]))
   sim_score.append(t)
   text.append(claim_embeddings'][i])

Current error

ValueError: Expected 2D array, got 1D array instead:

Expected output

df
title                       claims                  sim score
Lorem ipsum dolor sit amet fact checked claims one    0
Lorem ipsum dolor sit amet fact checked claims one    0
Lorem ipsum dolor sit amet fact checked claims one    0
Lorem ipsum dolor sit amet fact checked claims tweet   0  
Lorem ipsum dolor sit amet fact checked claims tweet   0
Lorem ipsum dolor sit amet fact checked claims tweet   0

I have tried Calculate cosine similarity for vectors between two pandas columns? but it didn't solve the issue.

Upvotes: 0

Views: 2151

Answers (1)

Vidya Ganesh
Vidya Ganesh

Reputation: 818

Here array.reshape(1, -1) must be used as you are comparing only for a single sample. For example cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))

from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
title=df['title'].tolist()
claim=df['claim'].tolist()
title=model.encode(title)
claim=model.encode(claim)
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(title[0].reshape(1,-1),claim[0].reshape(1,-1))

Upvotes: 0

Related Questions