Reputation: 33
I am trying to build a document retrieval interface which can take in an example document and retrieve the closest existing documents from a postgres db. It seems like a simple use case, however most examples I see are using longer form text documents which get chunked and the embeddings are done on shorter sections of a larger text, and lookup is usually just searching for specific information.
In my use case, the documents are customer reviews, and the embedding which is used for similarity search is a fake example review. When I use Open AI's text-embedding-ada-002, it basically just works. I am sure I could tune it for even better results but if I supply a fake review with specific criticisms, it will find real reviews with the same or similar critiques, and the same for positive. However, when I try to use hugging face transformers-js, and use gte-small or all-miniLM-L6-v2, the output is generally horrible. For some reason whether or not the input review is positive or negative, it almost exclusively retrieves positive reviews, and for the vast majority, none of the specific things mentioned seem to be related at all to the review that was supplied. When I look at the MTEB leaderboard, gte-small seems to be ranked higher than text-embedding-ada-002 in all of the tasks which seem relevant, so why is it preforming significantly worse here? Is there something which I absolutely have to do differently when using the huggingface models, or maybe there is a specific model on HF which is better for this sort of retrieval? Ill put the embedding code and sql query below in case something is blatantly wrong with it but I am at a loss here, the lookup is through pgvector btw.
const generateEmbedding = await pipeline('feature-extraction', 'Supabase/all-MiniLM-L6-v2')
let data = null
let error = null
try {
// Generate a vector using Transformers.js
const output = await generateEmbedding(text, {
pooling: 'mean',
normalize: true,
})
// Extract the embedding output
const embedding = JSON.stringify(Array.from(output.data))
data = {embedding: embedding}
} catch (e) {
error = e
}`
`-- CREATE HNSW index on reviews
CREATE INDEX ON public.reviews USING hnsw (embedding vector_cosine_ops);
-- CREATE function to retrieve review by embedding similarity
CREATE OR REPLACE FUNCTION get_reviews_by_embedding(
v vector(1536),
n INTEGER
)
RETURNS SETOF reviews
LANGUAGE plpgsql AS $$
BEGIN
SET LOCAL hnsw.ef_search = 150;
RETURN QUERY SELECT * FROM public.reviews ORDER BY reviews.embedding <#> v LIMIT n;
END
$$;
Upvotes: 2
Views: 3166
Reputation: 710
Everything OK in your code
The performance discrepancy you're observing between OpenAI's text-embedding-ada-002 and Hugging Face's gte-small or all-miniLM-L6-v2 could be attributed to several factors:
Each model has been trained on different architectures and datasets. The differences in architecture, training data, and pre-training objectives can result in varying performance on specific downstream tasks. It's possible that the architecture used by text-embedding-ada-002 is better suited for your specific use case.
Customer reviews might contain domain-specific language and nuances. If the training data for gte-small or all-miniLM-L6-v2 does not cover a similar domain or use case as your customer reviews, it might not perform as well. Some models might generalize better across domains, while others excel in more specialized domains. Seems that ada-002 is better optimized for your use case.
The process of generating embeddings from your input review and calculating similarity might differ across models. Parameters like tokenization, truncation, and aggregation of embeddings can influence the final result. It is not published (to the best of my knowledge) how ada-002 computes a sentence embedding (summation, mean, weighted mean, etc... of word embeddings), how it pads, truncates sentences, tokens, etc... So it is difficult to debug differences in embeddings mehtodology
Your use case is information retrieval. So, I would check the leaderboard of models in the MTEB benchmark for the Retrieval use case (https://huggingface.co/spaces/mteb/leaderboard).
Then, test the models on your use case looking the following guidelines:
Also, you may want to read the papers to know about the training strategy... If models where training in QA or Reviews, it may be more suitable for tour use case [1]: https://arxiv.org/pdf/2210.07316.pdf
Upvotes: 2