Reputation: 989
We are using openai via microsoft azure. Doing 'regular' prompts / using the api leads to the following output:
{
choices: [
{
content_filter_results: [Object],
finish_reason: 'stop',
index: 0,
logprobs: null,
message: [Object]
}
],
created: 1721653544,
id: 'chatcmpl-9nn1UksGuyliAGQsnqUrh3ADv5UhR',
model: 'gpt-4o-2024-05-13',
object: 'chat.completion',
prompt_filter_results: [ { prompt_index: 0, content_filter_results: [Object] } ],
system_fingerprint: 'fp_abc28019ad',
usage: { completion_tokens: 77, prompt_tokens: 36, total_tokens: 113 }
}
Doing prompts querying the search endpoint leads to:
{
id: 'afdcc11f-66e6-412b-8c03-5ee25a20d249',
model: 'gpt-4o',
created: 1721654963,
object: 'extensions.chat.completion',
choices: [ { index: 0, finish_reason: 'stop', message: [Object] } ],
usage: { prompt_tokens: 6193, completion_tokens: 32, total_tokens: 6225 },
system_fingerprint: 'fp_abc28019ad'
}
Please notice the extraordinary higher amount of tokens.
Here is our setup for the search api call
const SEARCH_BODY_TEMPLATE = {
data_sources: [
{
type: "azure_search",
parameters: {
filter: null,
endpoint: process.env.SEARCH_SERVICE_ENDPOINT,
index_name: process.env.SEARCH_INDEX_NAME,
project_resource_id: process.env.SEARCH_PROJECT_RESOURCE_ID,
semantic_configuration: "azureml-default",
authentication: {
"type": "system_assigned_managed_identity",
"key": null
},
role_information: "Your name is POC. Your are an intelligent assistant that has been developed to help all employees. Keep your answers short and clear.",
in_scope: true,
strictness: 1,
top_n_documents: 3,
key: process.env.SEARCH_KEY,
embedding_endpoint: "https://xxx.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15",
embedding_key: process.env.OPEN_AI_API_KEY,
query_type: "vectorSimpleHybrid"
}
}
],
messages: [{
role: "system",
content: "You are a basic assitant. Answer only if you really know. Otherwise answer 'i don't know'."
},
{
role: "user",
content: "What is math?"
}],
deployment: process.env.SEARCH_DEPLOYMENT_ID,
temperature: 0.7,
top_p: 0.95,
max_tokens: 200,
stop: null,
frequency_penalty: 0,
presence_penalty: 0,
}
When attaching data to the index, we also did the following to try to increase performance or lower the amount of tokes consumed (with no effect):
So we end up with some questions:
Upvotes: 0
Views: 300
Reputation: 14619
I think you did not get how RAG works.
That's normal to see this increase in terms of number of tokens processed: when you ask a simple question without any search added to it, you will just have:
When you use Search (whether is it simple / semantic / hybrid etc), the process is the following:
You can see it illustrated here: Azure OpenAI got "Prompt + Knowledge" in its input:
Lowering this amount of tokens can be done by several actions:
Obviously, it highly depends on your documents content and format because you still need to get the right info in those retrieved items if you want the LLM to answer correctly: that's a quality/cost balance to find
Upvotes: 2