Reputation: 31
We are investigating an issue on a deployed cloud run service, where requests made to the service occasionnaly fail with a StatusCodeError: 500
, while no log of said requests appear on cloud run.
Served requests usually produce two log lines detailing the request, route and exit code (POST 200 on https://service-name.a.run.app/route/...
)
projects/XXX/logs/run.googleapis.com/stdout
is produced by our application to log the serving of every requestprojects/XXX/logs/run.googleapis.com/requests
is automatically produced by cloud run on every requestWhen the incident occurs, none of those are logged. The client (running in a gke pod in the same project) has the only log of the failing requests, with the following message:
StatusCodeError: 500 - "\n<html><head>\n<meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<title>500 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered an error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n"
Rough timeline of the last incident:
[INFO] Handling signal: term
)The absence of logs during the incident makes it hard to investigate its cause, and at this stage we would be gratefull for any kind of lead.
Our service has another known issue, that may or may not be related. The service is designed to avoid multiple replicas, as a single one should be able to handle the load and serve concurrent requests (cloud run concurency = 80), but has a relatively long cold start time (~30s). This leads to 429 errors when a spike of requests comes while no replica is available (because of cloud run hard capping concurrency to 1 during cold start). This issue was somewhat mitigated by allowing some replication (currently maxScale = 3), since each replica can put a request on hold during the cold start, but will require some work on the client side to handle correctly (simple retries after the cold start).
Upvotes: 2
Views: 2460
Reputation: 1392
I have found this PIT that describes the aforementioned behavior. It seems to happen because a part of Cloud Run thinks that there are already provisioned instances handling the traffic but there aren't. This issue is currently being worked on internally but there's no ETA for a fix at the moment.
The current workaround is to set a maximum number of instances to at least 4.
Upvotes: 4