Reputation: 82
In a project i'm currently working on, I have a information extraction task to do with a LLM that requires a large set of instructions. Those instructions contains an object schema (and schema of component of this object), examples of input/output for this object (and its component), created with the library Kor in Python. This set of instruction is large, about 5000 tokens without the input. To realize this task, I send these instructions to an LLM, with the text in which these information need to be extracted.
I would like to deploy a model to use for my webapp in which multiple users may call this function at the same time. I currently use a model deployed on Azure-ai-foundry, but the problem is that any time the model is called, I need to give it the complete set of instructions, thus consuming an additional 5k token with each call.
Is there any way to deploy a model with this set of instructions already "in memory", without having to fine-tune a model here ? I would like to avoid having to send these instructions any time I want to do this task.
I probably can generate multiple conversations with a model, and reuse these conversations but the previous texts sent to the model would be added to the context which is not what i want to do here. Each call to this function is independant to any previous or future call. So keeping 5-10 conversations open, and sending it to an available conversation that may have previously been used, is a possible solution but that may generated problem with the context window and the initial instructions.
Upvotes: -1
Views: 32
Reputation: 2026
Great question, unfortunately the LLM is stateless session, which means that no matter you "save" the prompt in memory of Azure backend, it's still required full feeding instruction to LLM computation - that's the tricky point.
Microsoft last year announced Prompt Caching built-in feature available for Azure OpenAI model, if you can construct the prompt meets two conditions:
Then you have a chance to get discount:
For supported models, cached tokens are billed at a discount on input token pricing for Standard deployment types and up to 100% discount on input tokens for Provisioned deployment types.
Read more here https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/prompt-caching#getting-started
Another improvement that you already thought to, but more professional term - Semantic Caching: the interoperation between APIM and Redis Cache https://learn.microsoft.com/en-us/azure/api-management/azure-openai-enable-semantic-caching
Absolutely, you can build your own Semantic Caching strategy, this might require more effort and handiwork but can be tailored to your needs. Depend on your specific usecases, we can also detect the intend of the conversation and route that action to the tools directly, e.g. function calling to SQL database, sending email, making http request,...
The game changer now you may already knew - Agentic AI, which means that you will have multi-agent serving for specific task, dividing the "long prompt" tasks to smaller agents - also a good strategy to reduce your prompt https://techcommunity.microsoft.com/blog/azure-ai-services-blog/building-a-multimodal-multi-agent-framework-with-azure-openai-assistant-api/4084007
Upvotes: -1