LikeCoding
LikeCoding

Reputation: 43

Cosine Similarity for Sentences in Dataframe

I have a data frame which has two columns. The content column has about 8000 rows of sentences. The embeddings column has the embedding for each sentence from the content column.

enter image description here

I want to get the cosine similarity score for each pair of sentences.

I used: cosine_similarity (df['embeddings'][0], df['embeddings'][1:] ) However, it only gives me the cosine similarity matrix between the sentence 0 and the rest sentences.

What I want is a dataframe like:

enter image description here Any hints will be super helpful. Thank you!

Upvotes: 1

Views: 1383

Answers (1)

luke
luke

Reputation: 524

What you need is the cosine similarity of every combination of 2 sentences in the data frame.

This can be done using the itertools.combinations module.

Ex:

import pandas as pd
from itertools import combinations

sentenceCombs = pd.DataFrame(columns = ['Sentence0', 'Sentence1', 'CosineSim'])
idx = 0;
for comb in combinations(df.columns, 2):
   s0 = comb[0]
   s1 = comb[1]
   sentenceCombs.loc[idx] = [s0, s1, cosine_similarity(s0, s1)]

This code is untested, but with some modification (and a delimiter that definitely doesn't appear in your dataset), it should work well.

Upvotes: 1

Related Questions