alanmynah
alanmynah

Reputation: 221

Cloud Run Flask API container running shutit enters a sleep loop

The issue has appeared recently and the previously healthy container now enters a sleep loop when a shutit session is being created. The issue occurs only on Cloud Run and not locally.

Minimum reproducible code:

requirements.txt

Flask==2.0.1
gunicorn==20.1.0
shutit

Dockerfile

FROM python:3.9

# Allow statements and log messages to immediately appear in the Cloud Run logs
ENV PYTHONUNBUFFERED True

COPY requirements.txt ./
RUN pip install -r requirements.txt

# Copy local code to the container image.
ENV APP_HOME /myapp
WORKDIR $APP_HOME
COPY . ./

CMD exec gunicorn \
 --bind :$PORT \
 --worker-class "sync" \
 --workers 1 \
 --threads 1 \
 --timeout 0 \
 main:app

main.py

import os
import shutit
from flask import Flask, request

app = Flask(__name__)

# just to prove api works
@app.route('/ping', methods=['GET'])
def ping():
    os.system('echo pong')
    return 'OK'

# issue replication
@app.route('/healthcheck', methods=['GET'])
def healthcheck():
    os.system("echo 'healthcheck'")
    # hangs inside create_session
    shell = shutit.create_session(echo=True, loglevel='debug')
    # never shell.send reached 
    shell.send('echo Hello World', echo=True)
    # never returned
    return 'OK'

if __name__ == '__main__':
    app.run(host='127.0.0.1', port=8080, debug=True)

cloudbuild.yaml

steps:
  - id: "build_container"
    name: "gcr.io/kaniko-project/executor:latest"
    args:
      - --destination=gcr.io/$PROJECT_ID/borked-service-debug:latest
      - --cache=true
      - --cache-ttl=99h
  - id: "configure infrastructure"
    name: "gcr.io/cloud-builders/gcloud"
    entrypoint: "bash"
    args:
      - "-c"
      - |
        set -euxo pipefail

        REGION="europe-west1"
        CLOUD_RUN_SERVICE="borked-service-debug"

        SA_NAME="$${CLOUD_RUN_SERVICE}@${PROJECT_ID}.iam.gserviceaccount.com"

        gcloud beta run deploy $${CLOUD_RUN_SERVICE} \
          --service-account "$${SA_NAME}" \
          --image gcr.io/${PROJECT_ID}/$${CLOUD_RUN_SERVICE}:latest \
          --allow-unauthenticated \
          --platform managed \
          --concurrency 1 \
          --max-instances 10 \
          --timeout 1000s \
          --cpu 1 \
          --memory=1Gi \
          --region "$${REGION}"

cloud run logs that get looped:

Setting up prompt
In session: host_child, trying to send: export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
================================================================================
Sending>>> export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'<<<, expecting>>>['\r\nORIGIN_ENV:rkkfQQ2y# ']<<<
Sending in pexpect session (68242035994000): export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
Expecting: ['\r\nORIGIN_ENV:rkkfQQ2y# ']
export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
root@localhost:/myapp# export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
Stopped sleep .05
Stopped sleep 1
pexpect: buffer: b'' before: b'cm9vdEBsb2NhbGhvc3Q6L3B1YnN1YiMgIGV4cx' after: b'DQpPUklHSU5fRU5WOnJra2ZRUTJ5IyA='
Resetting default expect to: ORIGIN_ENV:rkkfQQ2y# 
In session: host_child, trying to send: stty cols 65535
================================================================================
Sending>>> stty cols 65535<<<, expecting>>>ORIGIN_ENV:rkkfQQ2y# <<<
Sending in pexpect session (68242035994000): stty cols 65535
Expecting: ORIGIN_ENV:rkkfQQ2y# 
ORIGIN_ENV:rkkfQQ2y# stty cols 65535
stty cols 65535
Stopped stty cols 65535
Stopped sleep .05
Stopped sleep 1

Workarounds tried:

Upvotes: 6

Views: 1039

Answers (2)

Priyashree Bhadra
Priyashree Bhadra

Reputation: 3597

I have reproduced your issue and we have discussed several possibilities, I think the issue is your Cloud Run not being able to process requests and hence preparing to shut down(sigterm). I am listing some possibilities for you to look at and analyse.

  • A good reason for your Cloud Run service failing to start is that the server process inside the container is configured to listen on the localhost (127.0.0.1) address. This refers to the loopback network interface, which is not accessible from outside the container and therefore Cloud Run health check cannot be performed, causing the service deployment failure. To solve this, configure your application to start the HTTP server to listen on all network interfaces, commonly denoted as 0.0.0.0.

  • While searching for the cloud logs error you are getting, I came across this answer and GitHub link from the shutit library developer which points to a technique for tracking inputs and outputs in complex container builds in shutit sessions. One good finding from the GitHub link, I think you will have to pass the session_type in shutit.create_session(‘bash’) or shutit.create_session(‘docker’) which you are not specifying in the main.py file. That can be the reason your shutit session is failing.

  • Also this issue could be due to some Linux kernel feature used by this shutit library which is not currently supported properly in gVisor . I am not sure how it was executed for you the first time. Most apps will work fine, or at least as well as in regular Docker, but may not provide 100% compatibility.

    Cloud Run applications run on gVisor container sandbox(which supports Linux only currently), which executes Linux kernel system calls made by your application in userspace. gVisor does not implement all system calls (see here). From this Github link, “If your app has such a system call (quite rare), it will not work on Cloud Run. Such an event is logged and you can use strace to determine when the system call was made in your app”

    If you're running your code on Linux, install and enable strace: sudo apt-get install strace Run your application with strace by prefacing your usual invocation with strace -f where -f means to trace all child threads. For example, if you normally invoke your application with ./main, you can run it with strace by invoking /usr/bin/strace -f ./main

    From this documentation, “ if you feel your issue is caused by a limitation in the Container sandbox . In the Cloud Logging section of the GCP Console (not in the "Logs'' tab of the Cloud Run section), you can look for Container Sandbox with a DEBUG severity in the varlog/system logs or use the Log Query:

resource.type="cloud_run_revision"
logName="projects/PROJECT_ID/logs/run.googleapis.com%2Fvarlog%2Fsystem"

For example: Container Sandbox: Unsupported syscall
setsockopt(0x3,0x1,0x6,0xc0000753d0,0x4,0x0)”

By default, container instances have min-instances turned off, with a setting of 0. We can change this default using the Cloud Console, the gcloud command line, or a YAML file, by specifying a minimum number of container instances to be kept warm and ready to serve requests.

You can also have a look at this documentation and GitHub Link which talks about the Cloud Run container runtime behaviour and troubleshooting for reference.

Upvotes: 2

Noam Yizraeli
Noam Yizraeli

Reputation: 5394

It's not a perfect replacement but you can use one of the following instead:

I'm not sure what's the big picture so I'll add various options

For remote automation tasks from a flask web server we're using paramiko for its simplicity and quick setup, though you might prefer something like pyinfra for large projects or subprocess for small local tasks.

  1. Paramiko - a bit more hands-on\manual than shutit, run commands over the ssh protocol.

example:

import paramiko

ip='server ip'
port=22
# you can also use ssh keys
username='username'
password='password'

cmd='some useful command' 

ssh=paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(ip,port,username,password)

stdin,stdout,stderr=ssh.exec_command(cmd)
outlines=stdout.readlines()
resp=''.join(outlines)
print(resp)

more examples

  1. pyinfra - ansible like library to automate tasks in ad-hoc style

example to install a package using apt:

from pyinfra.operations import apt

apt.packages(
    name='Ensure iftop is installed',
    packages=['iftop'],
    sudo=True,
    update=True,
)
  1. subprocess - like Paramiko not as extensive as shutit but works like a charm

Upvotes: 0

Related Questions