Reputation: 1050
I have an Inception V3 Model with some input and output modification deployed to the Google Cloud ML Engine for online predictions. During a week or so I had relatively few sparse requests (around 130) with median latency around 100ms and 95% percentile 2000ms. I have already generated around 2 node*hours. The minimum amount of nodes is set to 0. This is the first time when I want to use Cloud ML Engine in production.
The questions:
I know the nodes are up several minutes after the request. But how can I estimate the amount of requests, say per 1 minute, that will cause the scaling of the system? There seems to be no information on the CPU usage of nodes.
In my case I assume that the amount of requests will grow steadily. Should I expect node*hours to reach approximately 30*24 (amount of days time hours in month), then saturate at this value for some time, and then go further when CPU utilization of prediction nodes reaches, say 70%?
Upvotes: 0
Views: 82
Reputation: 209
We do publish request level logs on Stackdriver. You can turn them on by creating a model with online_prediction_logging = True. In those logs, we have a field called loading_request which can tell you if this request landed on a new machine. For a given shorter time period, this can give you a rough estimate on how many nodes were brought up. For more accurate node scale up, the feature that rhaertel80 suggested should help.
Upvotes: 0
Reputation: 8389
You will soon be able to monitor the number of nodes in use, but you can't do so yet. You can do a quick and dirty estimate based on your mean qps and latency. Assume approximately 60% utilization, then:
X qps * .2 secs/query / .6
Upvotes: 0