SteroidMan
SteroidMan

Reputation: 33

How to get better performance out of embedding models for retrieval/semantic search?

I am trying to build a document retrieval interface which can take in an example document and retrieve the closest existing documents from a postgres db. It seems like a simple use case, however most examples I see are using longer form text documents which get chunked and the embeddings are done on shorter sections of a larger text, and lookup is usually just searching for specific information.

In my use case, the documents are customer reviews, and the embedding which is used for similarity search is a fake example review. When I use Open AI's text-embedding-ada-002, it basically just works. I am sure I could tune it for even better results but if I supply a fake review with specific criticisms, it will find real reviews with the same or similar critiques, and the same for positive. However, when I try to use hugging face transformers-js, and use gte-small or all-miniLM-L6-v2, the output is generally horrible. For some reason whether or not the input review is positive or negative, it almost exclusively retrieves positive reviews, and for the vast majority, none of the specific things mentioned seem to be related at all to the review that was supplied. When I look at the MTEB leaderboard, gte-small seems to be ranked higher than text-embedding-ada-002 in all of the tasks which seem relevant, so why is it preforming significantly worse here? Is there something which I absolutely have to do differently when using the huggingface models, or maybe there is a specific model on HF which is better for this sort of retrieval? Ill put the embedding code and sql query below in case something is blatantly wrong with it but I am at a loss here, the lookup is through pgvector btw.

const generateEmbedding = await pipeline('feature-extraction', 'Supabase/all-MiniLM-L6-v2')
 
    let data = null
    let error = null

    try {
        // Generate a vector using Transformers.js
        const output = await generateEmbedding(text, {
            pooling: 'mean',
            normalize: true,
        })
    
        // Extract the embedding output
        const embedding = JSON.stringify(Array.from(output.data))

        data = {embedding: embedding}
    } catch (e) {
        error = e
    }`

`-- CREATE HNSW index on reviews
CREATE INDEX ON public.reviews USING hnsw (embedding vector_cosine_ops);

-- CREATE function to retrieve review by embedding similarity
CREATE OR REPLACE FUNCTION get_reviews_by_embedding(
    v vector(1536),
    n INTEGER
)
RETURNS SETOF reviews
LANGUAGE plpgsql AS $$
BEGIN
    SET LOCAL hnsw.ef_search = 150;
    RETURN QUERY SELECT * FROM public.reviews ORDER BY reviews.embedding <#> v LIMIT n;
END
$$;

Upvotes: 2

Views: 3166

Answers (1)

Daniel Perez Efremova
Daniel Perez Efremova

Reputation: 710

Everything OK in your code

why is it preforming significantly worse here?

The performance discrepancy you're observing between OpenAI's text-embedding-ada-002 and Hugging Face's gte-small or all-miniLM-L6-v2 could be attributed to several factors:

  1. Model Architecture and Training Data:

Each model has been trained on different architectures and datasets. The differences in architecture, training data, and pre-training objectives can result in varying performance on specific downstream tasks. It's possible that the architecture used by text-embedding-ada-002 is better suited for your specific use case.

  1. Domain-Specific Embeddings:

Customer reviews might contain domain-specific language and nuances. If the training data for gte-small or all-miniLM-L6-v2 does not cover a similar domain or use case as your customer reviews, it might not perform as well. Some models might generalize better across domains, while others excel in more specialized domains. Seems that ada-002 is better optimized for your use case.

  1. Embedding Generation and Similarity Calculation:

The process of generating embeddings from your input review and calculating similarity might differ across models. Parameters like tokenization, truncation, and aggregation of embeddings can influence the final result. It is not published (to the best of my knowledge) how ada-002 computes a sentence embedding (summation, mean, weighted mean, etc... of word embeddings), how it pads, truncates sentences, tokens, etc... So it is difficult to debug differences in embeddings mehtodology

Is there something which I absolutely have to do differently when using the huggingface models, or maybe there is a specific model on HF which is better for this sort of retrieval?

Your use case is information retrieval. So, I would check the leaderboard of models in the MTEB benchmark for the Retrieval use case (https://huggingface.co/spaces/mteb/leaderboard).

Then, test the models on your use case looking the following guidelines:

  1. Best fits your use case i.e. retrieves similar products reviews according your expertise
  2. Socres high in MTEB to ensure it is reliable, stable and state-of-the-art.
  3. Consider also the model size (GB) so it impacts in performance and makes difficult to deploy as you need to store tons of weights to serve the model. If an API exists as in OpenAI or Llama it may be a good option.
  4. Consider also the embedding size. The shorter size the better perfomance, but, the bigger size the most information captured. Try to find your most suitable combination or performance/retrieval quality by testing different models.

Also, you may want to read the papers to know about the training strategy... If models where training in QA or Reviews, it may be more suitable for tour use case [1]: https://arxiv.org/pdf/2210.07316.pdf

Upvotes: 2

Related Questions