WilliamR
WilliamR

Reputation: 31

Cloud Run: 500 Server Error with no log output

We are investigating an issue on a deployed cloud run service, where requests made to the service occasionnaly fail with a StatusCodeError: 500, while no log of said requests appear on cloud run.

Served requests usually produce two log lines detailing the request, route and exit code (POST 200 on https://service-name.a.run.app/route/...)

When the incident occurs, none of those are logged. The client (running in a gke pod in the same project) has the only log of the failing requests, with the following message:

StatusCodeError: 500 - "\n<html><head>\n<meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<title>500 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered an error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n"

Rough timeline of the last incident:

The absence of logs during the incident makes it hard to investigate its cause, and at this stage we would be gratefull for any kind of lead.


Our service has another known issue, that may or may not be related. The service is designed to avoid multiple replicas, as a single one should be able to handle the load and serve concurrent requests (cloud run concurency = 80), but has a relatively long cold start time (~30s). This leads to 429 errors when a spike of requests comes while no replica is available (because of cloud run hard capping concurrency to 1 during cold start). This issue was somewhat mitigated by allowing some replication (currently maxScale = 3), since each replica can put a request on hold during the cold start, but will require some work on the client side to handle correctly (simple retries after the cold start).

Upvotes: 2

Views: 2460

Answers (1)

Ajordat
Ajordat

Reputation: 1392

I have found this PIT that describes the aforementioned behavior. It seems to happen because a part of Cloud Run thinks that there are already provisioned instances handling the traffic but there aren't. This issue is currently being worked on internally but there's no ETA for a fix at the moment.

The current workaround is to set a maximum number of instances to at least 4.

Upvotes: 4

Related Questions