aalimsha
aalimsha

Reputation: 23

How to enable cross-language query search in Azure Cognitive Search with Ada Embeddings?

I am working on a project that involves using Azure Cognitive Search/Azure AI search for document search capabilities. We are using Ada Embeddings to create semantic vectors for our documents. However, we have a requirement to support cross-language query search, specifically querying Japanese documents with English questions.

My questions are :

  1. Do we need to incorporate language translation skills into Azure Cognitive Search when using Ada Embeddings to create vectors?
  2. Can Ada Embeddings support cross-language query search, where the source documents are in Japanese and we want to ask English questions on them?
  3. How can I use the text translation skill along with the chunking and embedding skills? Specifically, should I translate the text chunks individually and then pass the translated chunks to the embedding skill? Or is there a different approach I should consider? there are other skills as well that I am using, so should I merge the the outputs from ocr,etc and then chunk the merged content and then pass to embeddings? please advice.

From my understanding, Ada Embeddings is a language model developed by OpenAI primarily used for generating high-quality text. While it can create semantic embeddings for different languages, it does not inherently support cross-language query search.

To enable cross-language query search in Azure Cognitive Search with Ada Embeddings, I believe we would need to incorporate language translation skills. I am considering using translation skill for this purpose.

I would greatly appreciate any insights, guidance, or best practices on how to implement cross-language query search in Azure Cognitive Search with Ada Embeddings. Additionally, if there are any alternative approaches or considerations that I should be aware of, please let me know.

Thank you in advance for your help!

Upvotes: 0

Views: 367

Answers (1)

Gia Mondragon - MSFT
Gia Mondragon - MSFT

Reputation: 466

Irrespective of the language, embeddings should be created on the same vector space for similar terms, if using the same model, so for vectors on their own, it should work without translation.

However, there may be scenarios in your use case that won't be possible to find with vectors only (for example product references, very specific terms) and in which case it may be convenient to use hybrid search to take advantage of both worlds. If this is the case, you would have to use either a translator before you issue the query from the application or use text translation skill at ingestion time.

Upvotes: 1

Related Questions