How to get better performance out of embedding models for retrieval/semantic search?

Question

I am trying to build a document retrieval interface which can take in an example document and retrieve the closest existing documents from a postgres db. It seems like a simple use case, however most examples I see are using longer form text documents which get chunked and the embeddings are done on shorter sections of a larger text, and lookup is usually just searching for specific information.

In my use case, the documents are customer reviews, and the embedding which is used for similarity search is a fake example review. When I use Open AI's text-embedding-ada-002, it basically just works. I am sure I could tune it for even better results but if I supply a fake review with specific criticisms, it will find real reviews with the same or similar critiques, and the same for positive. However, when I try to use hugging face transformers-js, and use gte-small or all-miniLM-L6-v2, the output is generally horrible. For some reason whether or not the input review is positive or negative, it almost exclusively retrieves positive reviews, and for the vast majority, none of the specific things mentioned seem to be related at all to the review that was supplied. When I look at the MTEB leaderboard, gte-small seems to be ranked higher than text-embedding-ada-002 in all of the tasks which seem relevant, so why is it preforming significantly worse here? Is there something which I absolutely have to do differently when using the huggingface models, or maybe there is a specific model on HF which is better for this sort of retrieval? Ill put the embedding code and sql query below in case something is blatantly wrong with it but I am at a loss here, the lookup is through pgvector btw.

const generateEmbedding = await pipeline('feature-extraction', 'Supabase/all-MiniLM-L6-v2')
 
    let data = null
    let error = null

    try {
        // Generate a vector using Transformers.js
        const output = await generateEmbedding(text, {
            pooling: 'mean',
            normalize: true,
        })
    
        // Extract the embedding output
        const embedding = JSON.stringify(Array.from(output.data))

        data = {embedding: embedding}
    } catch (e) {
        error = e
    }`

`-- CREATE HNSW index on reviews
CREATE INDEX ON public.reviews USING hnsw (embedding vector_cosine_ops);

-- CREATE function to retrieve review by embedding similarity
CREATE OR REPLACE FUNCTION get_reviews_by_embedding(
    v vector(1536),
    n INTEGER
)
RETURNS SETOF reviews
LANGUAGE plpgsql AS $$
BEGIN
    SET LOCAL hnsw.ef_search = 150;
    RETURN QUERY SELECT * FROM public.reviews ORDER BY reviews.embedding <#> v LIMIT n;
END
$$;

How to get better performance out of embedding models for retrieval/semantic search?

Answers (1)

why is it preforming significantly worse here?

Is there something which I absolutely have to do differently when using the huggingface models, or maybe there is a specific model on HF which is better for this sort of retrieval?

Related Questions