Nadjib Bendaoud
Nadjib Bendaoud

Reputation: 524

Similarity search using Langchain Chroma not returning relevant results

I am using Langchain chroma DB to store and retrieve data.

The data in the vector DB is in French and was stored using openAI Embeddings.

Le code suivant: 84431390 décrit: Machines et appareils à imprimer offset (sauf alimentés en feuilles ou en bobines)

The ultimate goal is to build a chat assistant. But for now I isolated an issue with the similarity search in chromaDB which performs poorly when I'm searching for a numerical code (as seen previously). for instance, if I give the following input query:

 code suivant : 84823000

I should normally obtain the record containing the code in question, however I get the following results :

'Le code suivant : 84864000 décrit: Machines et appareils visés à la note 11 C du chapitre 84'

'Le code suivant : 84483900 décrit: Parties et accessoires des machines du n° 8445, n.d.a.'

'Le code suivant : 84313900 décrit: Parties de machines et appareils du n° 8428, n.d.a.'

Is it hard for the similarity search to find relevant code, or is there something else that I am missing.

Upvotes: 0

Views: 523

Answers (1)

sahil vaghasiya
sahil vaghasiya

Reputation: 1

It could be due to a few common issues. Here's a guide to troubleshooting and then checking the quality of your similarity search.

  1. Check data quality if there is noise in the data, embedding might not capture the right context.

  2. Try different embedding models. Sometimes, certain model works better for specific types of content.

  3. Inspect Embedding Distance:After retrieving search results, inspect the distance or similarity scores (if available).

This will help you understand whether the embeddings are close enough to be considered similar.

Upvotes: 0

Related Questions