Jan Hacker
Jan Hacker

Reputation: 495

How to fix CloudRun error 'The request was aborted because there was no available instance'

I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel. Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.

Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.

Potentially relevant gcloud beta run deploy arguments:

--memory 2Gi --concurrency 1 --timeout 8m --platform managed

What does the error message mean exactly -- and how can I solve the issue?

Upvotes: 36

Views: 25061

Answers (7)

Paku
Paku

Reputation: 777

This error can be business as usual for cloud run during scaling.

During scaling up, GCP networking stack routes your request to the cold starting instance even though it hasn't passed its' health check yet. So client of the request is left hanging for the duration of cold start + duration of the request.

This is suboptimal, since you might have existing cloud run instance resources that could serve the request immediately. Ideally there should be no routing to cold-starting instances if any current instances aren't too overloaded.

Load balancer keeps the client waiting until cold start + request is finished. These error messages pop up at client timeout. Timeout happens on combination of load_balancer timeout, cloud run service timeout, client timeout, GCP infra timeout (10 sec?). On timeout, load balancer says response_sent_by_backend status 500. Even though your "backend" instance aka your container never got the request due to networking.

For me the main problem is, why are cloud run instances scaling in scenarios when they shouldn't be according to docs?

Based on autoscaling logic which is brought out here and here. You might have 0 reason for cloud run to scale up, but it suddenly might scale up.

e.g.

  • CPU usage is 10% (Not 60%)
  • You have concurrency set to 80, max concurrent requests is 5
  • You have 1 instance currently, min instances is set to 1, max instances is 5

For these log cases where duration is displayed, it's important to look at receiveTimestamp of log vs timestamp of log. Timestamp is time of request arrived, receiveTimestamp is time of response sent.

Upvotes: 0

Vali7394
Vali7394

Reputation: 499

This error can be caused by one of the following.

  1. A huge sudden increase in traffic.
  2. A long cold start time.
  3. A long request processing time
  4. A sudden increase in request processing time
  5. The service reaching its maximum container instance limit (HTTP 429)

We have faced similar issue sporadically and it was due to a long request processing time when DB latencies are high for few requests.

Upvotes: 1

Learn2Code
Learn2Code

Reputation: 2280

Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.

Upvotes: 0

Onkar
Onkar

Reputation: 354

We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)

Upvotes: 10

Charles Offenbacher
Charles Offenbacher

Reputation: 3132

I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.

Upvotes: 2

guillaume blaquiere
guillaume blaquiere

Reputation: 75765

I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.

Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)

Upvotes: 6

Steren
Steren

Reputation: 7909

This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.

This usually happens when:

  1. traffic suddenly largely increase
  2. cold start time is long
  3. request time is long

Upvotes: 21

Related Questions