Why isn't Google Cloud Run performing the startup probe after scaling?

Question

I have written a Rust server that runs inside a Docker container on Google Cloud Run. The server receives infrequent requests and immediately responds with a 200 status code acknowledgement. It then runs an asynchronous background job and sends a callback request once it is done. The server queues background jobs and runs one at a time.

While the job runs, it requests its own /ping endpoint to keep the instance alive so that Cloud Run does not scale to 0 instances. Upon receiving a SIGINT the server immediately exits. This workflow appears to be correct and I can see that the instance is kept alive while the background job is running.

I have configured Cloud Run as follows:

Container command: ./server
Container arguments: --port 8080
Startup CPU Boost: No
CPU is always allocated: Yes
Execution environment: Second generation
Minimum number of instances: 0
Maximum number of instances: 1
Healthchecks: Default startup probe on TCP port 8080

After deploying and testing this scaling up/down process, everything seems to work correctly. However, if several hours passes with no activity and then a new request arrives, the instance does not seem to scale from 0 to 1 instances correct. Normally, when the server starts, I see the following in the logs:

POST https://my-service-tichlvfbva-nw.a.run.app/run-background-job
[1694802575] Monitor thread started.
[1694802575] Worker thread started.
[1694802575] Server started at http://0.0.0.0:8080.
INFO 2023-09-15T18:29:35.936759Z Default STARTUP TCP probe succeeded after 1 attempt for container "my-service-1" on port 8080.
[1694802576] Running background job...
GET https://my-service-nw.a.run.app/ping
GET https://my-service-nw.a.run.app/ping
GET https://my-service-nw.a.run.app/ping
[1694802578] Sending callback HTTP request.
[1694802578] Job finished.
[1694803476] SIGINT received. Exiting.

However, after several hours of inactivity, the startup TCP check does not appear to run. The logs show:

POST https://my-service-tichlvfbva-nw.a.run.app/run-background-job
Container terminated on signal 4.

The POST request fails with 503 Service Unavailable and all subsequent POSTs fail with the same error. I then have to re-deploy the service for it to start working again for several hours until it becomes unavailable again. I don't understand why I am getting a SIGILL signal (4) and I don't understand why the startup probe isn't running.

The background job does run some AVX2 and AVX512 instructions but the program checks at runtime if these are available on the target platform. The server doesn't seem to be getting that far, anyway, since it doesn't log out the 'Monitor thread started.' line which occurs before any request processing. I'm very confused as to what is going wrong.

I haven't tried the following steps yet because I'd like to understand what's going wrong, first. But thought I might try:

Switching to 'First generation' instances (maybe they're more reliable?)
Replacing the TCP probe with an HTTP probe (maybe it will always run, in that case?)
Setting the minimum number of instances to 1 to prevent scaling down (this will cost more money)

Edit: As requested in the comments, here is the deploy command. I'm not using a service.yml.

gcloud run deploy my-service \
--image=europe-west2-docker.pkg.dev/my-service-398017/docker/my_service:${{ github.ref_name }} \
--region=europe-west2 \
--allow-unauthenticated \
--command=./server \
--args=--port,8080 \
--execution-environment=gen2 \
--min-instances=0 \
--max-instances=1 \
--cpu=8 \
--memory=4Gi \
--no-cpu-throttling

Why isn't Google Cloud Run performing the startup probe after scaling?

Answers (1)

Possible Solutions :

Here are my answers for the questions you have raised :

Related Questions

Why isn&#39;t Google Cloud Run performing the startup probe after scaling?

Answers (1)

Possible Solutions :

Here are my answers for the questions you have raised :

Related Questions

Why isn't Google Cloud Run performing the startup probe after scaling?