Reputation: 121
I wanted to compute the cosine similarity between two DataFrame(for a different sizes) and store the result in the new data. The similarity is calculated using BERT embeddings
df1
title
Lorem ipsum dolor sit amet
Lorem ipsum dolor sit amet
Lorem ipsum dolor sit amet
df2
claim
fact checked claims one
fact checked claims tweet
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
df_1['title_embeddings']=df_1['title'].apply(lambda x: model.encode(x))
df_2['claim_embeddings']=df_2['claim'].apply(lambda x: model.encode(x))
sim_score=[]
text =[]
for i in range(len(df['claim_embeddings'])):
t=df['title_embeddings'].apply(lambda x: cosine_similarity(x, df['claim_embeddings'][i]))
sim_score.append(t)
text.append(claim_embeddings'][i])
Current error
ValueError: Expected 2D array, got 1D array instead:
Expected output
df
title claims sim score
Lorem ipsum dolor sit amet fact checked claims one 0
Lorem ipsum dolor sit amet fact checked claims one 0
Lorem ipsum dolor sit amet fact checked claims one 0
Lorem ipsum dolor sit amet fact checked claims tweet 0
Lorem ipsum dolor sit amet fact checked claims tweet 0
Lorem ipsum dolor sit amet fact checked claims tweet 0
I have tried Calculate cosine similarity for vectors between two pandas columns? but it didn't solve the issue.
Upvotes: 0
Views: 2151
Reputation: 818
Here array.reshape(1, -1)
must be used as you are comparing only for a single sample. For example cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
title=df['title'].tolist()
claim=df['claim'].tolist()
title=model.encode(title)
claim=model.encode(claim)
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(title[0].reshape(1,-1),claim[0].reshape(1,-1))
Upvotes: 0