Can Google Gemini Context Caching accept multi-modal input?

Question

the main doc where this is discussed isn't exactly clear.

"Cached content can be any of the MIME types supported by Gemini multimodal models. For example, you can cache a large amount of text, audio, or video. You can specify more than one file to cache."

My thinking was that I could use the Context Cache to cache an entire prompt with multi-modal input (e.g. a list of mixed images and text) in the same way a system prompt works. Like it is just prepended before any downstream prompt i use that references the cache. For e.g., I spend a million tokens teaching Gemini to do something in a multi-modal cached prompt and it can be used repeatedly (prepended before) a much smaller prompt

However, the statement above could also be read as you can only cache specific MIME types. For e.g. instead of an entire multi-modal prompt, I can only cache the images from that prompt. If thats true, and my intention is to use the cached files in a downstream multi-modal prompt, how would you reference each image uniquely in the downstream prompt?

I realize the feature is still Pre-GA, but I hope we get some more examples in these docs

Can Google Gemini Context Caching accept multi-modal input?

Answers (0)

Related Questions