Michael Kim
Michael Kim

Reputation: 771

gunicorn worker exits for every request

I have a fresh installation of apache-airflow 1.8.2, started its webserver, and its gunicorn workers exit for every webpage request, leaving the request hang for around 30s while waiting for a new worker to spawn. Need help fixing this issue.

Details below

I've installed apache-airflow 1.8.2 and followed this guide. I started the webserver at port 8081.

Now when I visit the server using my browser, the response is very slow. I looked at the console output, and noticed that every time I load a webpage, it says "Worker existing", and then pauses for a long time and says "Booting worker".

After digging the source code I found out that these are gunicorn workers. I have no experience with gunicorn or airflow or Flask, so I don't know if this is the expected behavior, but I feel like it shouldn't. At least a webserver should not hang for half a minute for every webpage.

Console output:

---> Browser request
[2017-11-01 19:08:07 -0700] [14549] [INFO] Worker exiting (pid: 14549)
---> Hangs for 30s
[2017-11-01 19:08:37 -0700] [13316] [INFO] Handling signal: ttin
[2017-11-01 19:08:37 -0700] [14698] [INFO] Booting worker with pid: 14698
/Users/michael/Programs/clones/airflow/airflow/www/app.py:23: FlaskWTFDeprecationWarning: "flask_wtf.CsrfProtect" has been renamed to "CSRFProtect" and will be removed in 1.0.
  csrf = CsrfProtect()
/Users/michael/Programs/miaozhen/tests/airflow-test/lib/python3.6/site-packages/flask/exthook.py:71: ExtDeprecationWarning: Importing flask.ext.cache is deprecated, use flask_cache instead.
  .format(x=modname), ExtDeprecationWarning
127.0.0.1 - - [01/Nov/2017:19:08:37 -0700] "GET /admin/ HTTP/1.1" 200 95063 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
[2017-11-01 19:08:38,096] [14698] {models.py:168} INFO - Filling up the DagBag from /Users/michael/airflow/dags
---> other GET requests on the same webpage, skipped here for simplicity
[2017-11-01 19:08:39 -0700] [13316] [INFO] Handling signal: ttou

Now I'm running a source version of apache-airflow 1.8.2 (i.e. cloned the source, checked out the tag, and installed with pip install -e .) in a virtualenv. However I've also tried: running the pypi version (pip install apache-airflow) without virtualenv; running the source version without virtualenv. And the same problem exists for all installations, so these differences are irrelevant.

My Python installation is:

$ python -VV
Python 3.6.3 (default, Oct  4 2017, 06:09:38) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)]

EDIT:

I tried installing & running apache-airflow on another machine (Ubuntu Linux 16.04 + Python 3.5), and there is no problem. I also asked another person who is on Mac with Python 3.6, and there is no problem either. I guess there's something weird with my machine... Any suggestion how I can debug this thing?

Upvotes: 5

Views: 9909

Answers (3)

Gord1i
Gord1i

Reputation: 11

I ran into the same issue with airflow 1.10.1, with the WUI being very unresponsive, and sure enough, calls to socker.getfqdn() were taking 10s of seconds.

I'm running inside an older version of Docker, which is known to have trouble when a container tries to resolve its own address via DNS.

As a workaround, I changed the configuration value hostname_callable in airflow.cfg to socket:gethostname, and the WUI proved responsive.

This does means the hostname used in all logs will be the local hostname (i.e. what you get when you run hostname).

Upvotes: 1

Michael Kim
Michael Kim

Reputation: 771

Okay, after debugging into source codes I managed to isolate the problem, however it has nothing to do with gunicorn.

The issue is this

$ time python -c 'import socket; socket.getfqdn()'

real    0m30.051s
user    0m0.018s
sys     0m0.013s

I think the question should be closed as it's invalid now, but I have no idea how to do it.

Upvotes: 0

Daniel Huang
Daniel Huang

Reputation: 6538

Workers regularly exiting from the signal ttou (means decrement # of processes by one) every so often is intentional. This is airflow periodically "refreshing" workers. Based on what I read in AIRFLOW-276, which added this feature, refreshing workers is to ensure they pickup on new or updated DAGs. This behavior can be modified in your airflow config under worker_refresh_interval and worker_refresh_batch_size.

From looking at the source, it spins new workers up before spinning down old workers, so I don't think this would cause a delay for your requests. However, you can try disabling it with worker_refresh_batch_size = 0.

Upvotes: 5

Related Questions