Reputation: 936
We created a Composer v2 environment to migrate from Google Cloud Composer v1.
All DAG code was adjusted and we are using the to this date newest available image composer-2.0.0-preview.5-airflow-2.1.4
.
We noticed that even though the CPU is relaxed and memory is plenty, the Web server health
is flaky (red / green alternating every couple of minute on the environment monitoring page).
For a test I removed the K8s health check on the webserver pod in K8s (and the startup probe as well). I then found that there is a call coming from the IP of the airflow-monitoring
pod (10.63.129.6
), and shortly thereafter the gunicorn process receives a HUP:
airflow-webserver 10.63.129.6 - - [17/Nov/2021:12:56:03 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "python-requests/2.24.0"
airflow-webserver [2021-11-17 12:56:05 +0000] [57] [INFO] Handling signal: hup
airflow-webserver [2021-11-17 12:56:05 +0000] [57] [INFO] Hang up: Master
airflow-webserver [2021-11-17 12:56:05 +0000] [1083] [INFO] Booting worker with pid: 1083
airflow-webserver [2021-11-17 12:56:05 +0000] [1084] [INFO] Booting worker with pid: 1084
airflow-webserver [2021-11-17 12:56:05 +0000] [1051] [INFO] Worker exiting (pid: 1051)
airflow-webserver [2021-11-17 12:56:05 +0000] [1052] [INFO] Worker exiting (pid: 1052)
airflow-webserver [2021-11-17 12:56:05 +0000] [1085] [INFO] Booting worker with pid: 1085
airflow-webserver [2021-11-17 12:56:05 +0000] [1086] [INFO] Booting worker with pid: 1086
airflow-webserver [2021-11-17 12:56:07 +0000] [57] [WARNING] Worker with pid 1052 was terminated due to signal 15
airflow-webserver [2021-11-17 12:56:07 +0000] [57] [WARNING] Worker with pid 1051 was terminated due to signal 15
This happens every minute, so webserver is responding slowly.
As the airflow-monitoring
pod is running in a protected namespace in GKEAutopilot, I am not sure how to debug this further.
Update:
It seems like there are two things at play here, one looks like a race-condition between the gcs-syncd
pod and the webserver
:
Removing file:///home/airflow/gcs/plugins/operators/__pycache__/trigger_emarsys_event_operator.cpython-38.pyc
{webserver_command.py:217} ERROR - [Errno 2] No such file or directory: '/home/airflow/gcs/plugins/operators/__pycache__/trigger_emarsys_event_operator.cpython-38.pyc'
{webserver_command.py:218} ERROR - Shutting down webserver
Upvotes: 2
Views: 1362
Reputation: 936
Disabling the reloading of the webserver, when plugin changes are detected, resolved the situation. Each time when the gcs-sync happened this was the thing that triggered the restart. Thanks to @MateuszH for the tip!
[webserver]
reload_on_plugin_change=False
Upvotes: 2
Reputation: 1428
If you have configured core:default_timezone
airflow configuration, environment health status is just a metric and it will not have any impact on the actual job/tasks execution.
You can ignore the health status or you can remove the configuration to accept default UTC timezone.
This is because Composer runs a liveness DAG named airflow_monitoring every 5 minutes and reports environment health as follow:
Upvotes: 0