Vikas Rathod
Vikas Rathod

Reputation: 11

How does similarity_search_with_score() calculate the scores while retrieving the most similar document from embedding

I am trying to retrieve most similar documents based on question and i am getting top_k =5 docs. but, How does similarity_search_with_score() calculate the scores while retrieving the most similar document from embedding.

I want to know the mathematics behind the similarity_search_with_score() method.also need to understand how they re-rank documents based or score.

Upvotes: 1

Views: 3914

Answers (2)

Simi
Simi

Reputation: 323

Semantic similarity search methods would typically return the n most similar results, which are defined as the five samples that are closest to the input vector. Closeness can for instance be defined as the Euclidean distance or cosine distance between 2 vectors.

To scale such a similarity search, you will need some kind of indexing algorithm upfront. IVF and HNSW are two popular ones that you will find in most vector databases. This blog explains it very well.

Upvotes: 0

Odney
Odney

Reputation: 863

According to the LangChain documentation, the method similarity_search_with_score uses the Euclidean (L2) distance to calculate the score and returns the documents ordered by this distance with their corresponding scores (distances).

There is no additional re-ranking as you are suggesting, the method returns the same top k documents as the simpler method similarity_search. The difference is that the score (Euclidean distances between the query and the documents) is also returned.

Upvotes: 1

Related Questions