Dion Neo
Dion Neo

Reputation: 19

How can I [increase the rate limits/ batch requests] for the Google Vertex AI Bison API?

I'm testing out google palm API to recursively summarize a long text, and have since come into rate-limiting issues and therefore some questions to verify on.

It seems that the number of requests made to the bison API is 60/min (this seems quite low).

  1. Is there a way to batch the requests made to the bison API? And will that allow me to make more inference per second?
  2. Is there a way to increase the rate limits? 60/min seems too low and not fit for production usage.

Thanks!

I tried looking into these documents:

1.Rate limit documents: Table for rate limits

2.Increasing rate limits but it seems like it's not meant for the bison model

Upvotes: 1

Views: 1391

Answers (1)

Okry Dokry
Okry Dokry

Reputation: 135

  1. You can make batch requests to text-bison. https://cloud.google.com/vertex-ai/docs/generative-ai/text/batch-prediction-genai The linked document describes how to prepare you batch inputs and invoke a batch request.

You can also submit a batch request as part of a Vertex--AI pipeline job https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/batch_eval_llm.ipynb

  1. You can request a quota increase in your project. The name of the quota in this case is "Online prediction requests per base model per minute per region per base_model" and you would request a quota change for the region and model of interest. e.g. us-central1, text-bison

Upvotes: 0

Related Questions