Reputation: 43
I have a data frame which has two columns. The content column has about 8000 rows of sentences. The embeddings column has the embedding for each sentence from the content column.
I want to get the cosine similarity score for each pair of sentences.
I used: cosine_similarity (df['embeddings'][0], df['embeddings'][1:] ) However, it only gives me the cosine similarity matrix between the sentence 0 and the rest sentences.
What I want is a dataframe like:
Any hints will be super helpful. Thank you!
Upvotes: 1
Views: 1383
Reputation: 524
What you need is the cosine similarity of every combination of 2 sentences in the data frame.
This can be done using the itertools.combinations
module.
Ex:
import pandas as pd
from itertools import combinations
sentenceCombs = pd.DataFrame(columns = ['Sentence0', 'Sentence1', 'CosineSim'])
idx = 0;
for comb in combinations(df.columns, 2):
s0 = comb[0]
s1 = comb[1]
sentenceCombs.loc[idx] = [s0, s1, cosine_similarity(s0, s1)]
This code is untested, but with some modification (and a delimiter that definitely doesn't appear in your dataset), it should work well.
Upvotes: 1