Scaling Behavior on Google's Vertex AI Versus AI Platform

Question

We are in the process of migrating from AI Platform to Vertex AI, specifically for our inference-time / online predictions.

Are there any known techniques / configurations for deploying a model to an endpoint on Vertex AI such that the scaling behavior is similar to the scaling behavior on AI Platform?

On AI Platform, our model is deployed on a mls1-c1-m2 machine, which I believe is just 1 core. When we run a performance test, we can immediately start sending 10 requests per second, and the service scales out great. There is some initial delay at first, but then it seems as if the service scales out rapidly. Unfortunately I can't provide much quantifiable proof of this - when I click the "Resource Usage" tab on the AI platform model, I see "No data is available for the selected time frame".

If we run the same performance test on Vertex AI (immediately sending 10 requests per second), there are a couple of adjustments I have to make in order to avoid a build up of 500 errors. I have to use a 4 core machine (n1-standard-4), and increase the wait time between retries to 2 minutes (and I set maxReplicaCount to 5). Even then, however, the performance test ends with a couple dozen errors.

From what I have observed so far, it appears that adding additional nodes in Vertex AI takes roughly 2-3 minutes (I've tried n1-standard-2, n1-standard-4 and e2-standard-4), whereas in AI platform, I'm concluding that adding additional capacity may take just a few seconds, considering that 10 requests per second would most likely overwhelm a mls1-c1-m2 machine.

That being said, are there any adjustments you'd recommend in order to achieve scaling behavior similar to AI Platform?

Scaling Behavior on Google's Vertex AI Versus AI Platform

Answers (0)

Related Questions

Scaling Behavior on Google&#39;s Vertex AI Versus AI Platform

Answers (0)

Related Questions

Scaling Behavior on Google's Vertex AI Versus AI Platform