MightyTreeFrog
MightyTreeFrog

Reputation: 55

NLP: Finding which sentence is closest in meaning to a list of other sentences

I have two lists of sentences (list A and list B). I want to find which sentence in A is closest in meaning to the entirety of B.

This is not the same as the standard cosine similarity check you can do when comparing (in spacy for example) two doc objects: even if i iterate through A and compare each element of A to all elements of B, it leaves me with a number of cosine similarity scores, while i want just one number to represent the closeness of each element of A to all of B.

So far I have tried the folowing: for every element in A, perform cosine similarity check with every element in B, leaving me with a list of values equal in length to B. Then I calculate the average of this list, leaving me with a single value which would ideally represent how close that element of A was to all of B.

The issue with that approach is that the averaging results in too much information loss and by the time ive done this for all elements of A, there isnt much difference in these condensed averages and therefore hard to conclude which element of A is closest to all of B.

P.S. I can show code if asked but feel it's irrelevant because the issue is with the approach itself, not broken code.

Upvotes: 1

Views: 876

Answers (1)

pmbaumgartner
pmbaumgartner

Reputation: 712

I have a few approaches when I have a similar problem -- for me it's often comparing new documents to a cluster of documents and finding which cluster is "most similar."

First, a sidebar, you can totally do this in spaCy, but if you're dealing with sentences or shorter paragraphs, you might want to try embedding them with a model from SentenceTransformers. SpaCy's document embeddings are just the average of the word embeddings, and embedding the full document with a model intended to do that might give you better results.

Assuming you have lists of documents A and B, and embeddings for both, what I would do first instead of averaging cosine similarities is average the embeddings of B, then find the cosine similarity between each item in A and this average B embedding.

Sometimes, as you experienced with your original approach, averaging results in a loss of information. Going back to having our lists of embeddings for A and B--another approach I take, especially if the documents in B are highly variable in content, is for each document in A find the document in B with the max cosine similarity value. The benefit here is that you might find "clusters" of similar documents and be able to evaluate those. This is nice because the idea of the "meaning of the entirety of B" isn't well defined, especially if B contains lots of documents. This is a nice way to decompose both A and B and better understand groups of documents similar between them.

Whatever you choose, I hope you post back with the results!

Upvotes: 1

Related Questions