Cristian Sepulveda
Cristian Sepulveda

Reputation: 1730

embeddings and semantic search in spanish

I'm building an AI assistant that interacts with custom Q&A stored in a vector database.

All examples of it shows as a very simple task of chunking documents (QA in this case), creating embeddings, storing them in a vector DB, and then querying when searching...

However, the OpenAI embedding is not giving me good results when it comes to Q&A in Spanish, specifically when trying semantic search. For example, if I have a pair of Q&A that talks about "mar" (sea in English), but then I query for "Ocean," it should be close to the "mar" embeddings, but that is not the case.

What is the workflow to create good embeddings for Spanish? Do I have to preprocess the Q&A text before creating the embeddings? Is there a better model than OpenAI to do this? I have search a lot of it but all tutorial are for english. I think that the answer to spanish could apply for other languages too.

Upvotes: 0

Views: 613

Answers (1)

Clem
Clem

Reputation: 63

I ran into the same issue. OpenAI embeddings are imperfect, for example they're often good at logical similarity but not necessarily at semantic similarity (so, for example, two antonyms may have a high cosine similarity because they belong to the same topic, when you'd expect them to be far away because their respective meanings are opposite).

One way to solve this, although I haven't tried it personally, would be to follow openai's cookbook on the topic. In a nutshell, you'll provide labeled training examples and the ouput will be a matrix you can multiply your embeddings with. And hopefully after that the newly-computed embeddings will be able to better perform on your specific task with your specific data.

If you do try this approach, please let me know how it went!

Upvotes: 0

Related Questions