p0rter
p0rter

Reputation: 989

Why is the consumption of openai tokens in azure hybrid search 100x higher in comparison to 'regular' prompts?

We are using openai via microsoft azure. Doing 'regular' prompts / using the api leads to the following output:

{
  choices: [
    {
      content_filter_results: [Object],
      finish_reason: 'stop',
      index: 0,
      logprobs: null,
      message: [Object]
    }
  ],
  created: 1721653544,
  id: 'chatcmpl-9nn1UksGuyliAGQsnqUrh3ADv5UhR',
  model: 'gpt-4o-2024-05-13',
  object: 'chat.completion',
  prompt_filter_results: [ { prompt_index: 0, content_filter_results: [Object] } ],
  system_fingerprint: 'fp_abc28019ad',
  usage: { completion_tokens: 77, prompt_tokens: 36, total_tokens: 113 }
}

Doing prompts querying the search endpoint leads to:

{
  id: 'afdcc11f-66e6-412b-8c03-5ee25a20d249',
  model: 'gpt-4o',
  created: 1721654963,
  object: 'extensions.chat.completion',
  choices: [ { index: 0, finish_reason: 'stop', message: [Object] } ],
  usage: { prompt_tokens: 6193, completion_tokens: 32, total_tokens: 6225 },
  system_fingerprint: 'fp_abc28019ad'
}

Please notice the extraordinary higher amount of tokens.

Here is our setup for the search api call

const SEARCH_BODY_TEMPLATE = {
    data_sources: [
        {
            type: "azure_search",
            parameters: {
                filter: null,
                endpoint: process.env.SEARCH_SERVICE_ENDPOINT,
                index_name: process.env.SEARCH_INDEX_NAME,
                project_resource_id: process.env.SEARCH_PROJECT_RESOURCE_ID,
                semantic_configuration: "azureml-default",
                authentication: {
                    "type": "system_assigned_managed_identity",
                    "key": null
                },
                role_information: "Your name is POC. Your are an intelligent assistant that has been developed to help all employees. Keep your answers short and clear.",
                in_scope: true,
                strictness: 1,
                top_n_documents: 3,
                key: process.env.SEARCH_KEY,
                embedding_endpoint: "https://xxx.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15",
                embedding_key: process.env.OPEN_AI_API_KEY,
                query_type: "vectorSimpleHybrid"
            }
        }
    ],
    messages: [{
        role: "system",
        content: "You are a basic assitant. Answer only if you really know. Otherwise answer 'i don't know'."
    },
    {
        role: "user",
        content: "What is math?"
    }],
    deployment: process.env.SEARCH_DEPLOYMENT_ID,
    temperature: 0.7,
    top_p: 0.95,
    max_tokens: 200,
    stop: null,
    frequency_penalty: 0,
    presence_penalty: 0,
}

When attaching data to the index, we also did the following to try to increase performance or lower the amount of tokes consumed (with no effect):

So we end up with some questions:

Upvotes: 0

Views: 300

Answers (1)

Nicolas R
Nicolas R

Reputation: 14619

I think you did not get how RAG works.

That's normal to see this increase in terms of number of tokens processed: when you ask a simple question without any search added to it, you will just have:

  • your question is sent directly to the GPT model, so only your question text will be counted as prompt tokens
  • the answer generated counted as completion tokens

When you use Search (whether is it simple / semantic / hybrid etc), the process is the following:

  • you query is sent to your Search service to try to find the top N matching documents (in your case; K = 3 as you stated "top_n_documents: 3" in your query), so you get 3 blocks of around X tokens each (X depends on your chunking strategy on your index)
  • your query and these tokens are concatenated, to be sent to your GPT models: as a consequence, you have a lot more prompt tokens (initial query + all the tokens from your 3 chunks that you got on the previous step)

You can see it illustrated here: Azure OpenAI got "Prompt + Knowledge" in its input: Basic RAG architecture with Azure AI Search

Lowering this amount of tokens can be done by several actions:

  • reduce the size of the chunks when ingesting to Azure AI Search
  • Reduce the top N documents value to have less documents retrieved from Azure AI Search

Obviously, it highly depends on your documents content and format because you still need to get the right info in those retrieved items if you want the LLM to answer correctly: that's a quality/cost balance to find

Upvotes: 2

Related Questions