JeffLuppes
JeffLuppes

Reputation: 179

Models taking long to load on GCP app engine and workers restarting

So the problem I had was that an app-engine instance which was running a flask API, was stuck in a loop of endless worker restarts and was unresponsive the entire time, which prompted app engine to scale up and add instances (up to 20!).

The flask API served multiple machine learning models, which had to be loaded in one-by-one. Loading in one of these models apparently took very long and caused the worker to be terminated. The logs essentially showed this:

    A 2020-03-20T14:42:23Z [2020-03-20 14:42:23 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:2952)
    A 2020-03-20T14:42:23Z [2020-03-20 14:42:23 +0000] [2952] [INFO] Worker exiting (pid: 2952)
    A 2020-03-20T14:42:24Z [2020-03-20 14:42:24 +0000] [2975] [INFO] Booting worker with pid: 2975

Changing these settings in the app.yaml had no effect, as they are on a higher level:

liveness_check:
  initial_delay_sec: 300
  check_interval_sec: 30
  timeout_sec: 4
  failure_threshold: 4
  success_threshold: 2
readiness_check:
  check_interval_sec: 5
  timeout_sec: 4
  failure_threshold: 2
  success_threshold: 2
  app_start_timeout_sec: 300

Upvotes: 2

Views: 476

Answers (2)

Dustin Ingram
Dustin Ingram

Reputation: 21520

You should set --timeout 0 for infinite timeouts.

The gunicorn arbiter gets confused when App Engine scales down instances and thinks workers have timed out.

App Engine has its own supervisor which oversees timeouts (with a much longer timeout period), so it's not necessary for Gunicorn to handle worker timeouts.

Upvotes: 1

JeffLuppes
JeffLuppes

Reputation: 179

After a quick google it seemed much more likely that the timeouts were gunicorn workers running off into the mist. I found these docs that allowed me to set the timeout time in seconds.

Lo and behold. In my app.yaml file I added the -t 75 and was able to fix the problem. Turns out that one of the older model - a big Naive Bayes classifier - was taking around 50s to even load.

My app.yaml:

entrypoint: gunicorn -b :$PORT main:app -t 75

I saw that there were some people running flask APIs on app engine that also encountered this problem in some variation, so I figured I'd provide this extra breadcrumb.

Upvotes: 0

Related Questions