Tobi
Tobi

Reputation: 936

Google Cloud Composer v2 health-check seems to be false negative/flaky

We created a Composer v2 environment to migrate from Google Cloud Composer v1. All DAG code was adjusted and we are using the to this date newest available image composer-2.0.0-preview.5-airflow-2.1.4.

We noticed that even though the CPU is relaxed and memory is plenty, the Web server health is flaky (red / green alternating every couple of minute on the environment monitoring page).

For a test I removed the K8s health check on the webserver pod in K8s (and the startup probe as well). I then found that there is a call coming from the IP of the airflow-monitoring pod (10.63.129.6), and shortly thereafter the gunicorn process receives a HUP:

airflow-webserver 10.63.129.6 - - [17/Nov/2021:12:56:03 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "python-requests/2.24.0"
airflow-webserver [2021-11-17 12:56:05 +0000] [57] [INFO] Handling signal: hup
airflow-webserver [2021-11-17 12:56:05 +0000] [57] [INFO] Hang up: Master
airflow-webserver [2021-11-17 12:56:05 +0000] [1083] [INFO] Booting worker with pid: 1083
airflow-webserver [2021-11-17 12:56:05 +0000] [1084] [INFO] Booting worker with pid: 1084
airflow-webserver [2021-11-17 12:56:05 +0000] [1051] [INFO] Worker exiting (pid: 1051)
airflow-webserver [2021-11-17 12:56:05 +0000] [1052] [INFO] Worker exiting (pid: 1052)
airflow-webserver [2021-11-17 12:56:05 +0000] [1085] [INFO] Booting worker with pid: 1085
airflow-webserver [2021-11-17 12:56:05 +0000] [1086] [INFO] Booting worker with pid: 1086
airflow-webserver [2021-11-17 12:56:07 +0000] [57] [WARNING] Worker with pid 1052 was terminated due to signal 15
airflow-webserver [2021-11-17 12:56:07 +0000] [57] [WARNING] Worker with pid 1051 was terminated due to signal 15

This happens every minute, so webserver is responding slowly. As the airflow-monitoring pod is running in a protected namespace in GKEAutopilot, I am not sure how to debug this further.

Update: It seems like there are two things at play here, one looks like a race-condition between the gcs-syncd pod and the webserver:

Removing file:///home/airflow/gcs/plugins/operators/__pycache__/trigger_emarsys_event_operator.cpython-38.pyc
{webserver_command.py:217} ERROR - [Errno 2] No such file or directory: '/home/airflow/gcs/plugins/operators/__pycache__/trigger_emarsys_event_operator.cpython-38.pyc'
{webserver_command.py:218} ERROR - Shutting down webserver

Upvotes: 2

Views: 1362

Answers (2)

Tobi
Tobi

Reputation: 936

Disabling the reloading of the webserver, when plugin changes are detected, resolved the situation. Each time when the gcs-sync happened this was the thing that triggered the restart. Thanks to @MateuszH for the tip!

[webserver]
reload_on_plugin_change=False

https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#reload-on-plugin-change

Upvotes: 2

Jose Gutierrez Paliza
Jose Gutierrez Paliza

Reputation: 1428

If you have configured core:default_timezone airflow configuration, environment health status is just a metric and it will not have any impact on the actual job/tasks execution.

You can ignore the health status or you can remove the configuration to accept default UTC timezone.

This is because Composer runs a liveness DAG named airflow_monitoring every 5 minutes and reports environment health as follow:

  • When the DAG run finishes successfully the health status is True.
  • If the DAG run fails, the health status is False.
  • If the DAG does not finish, Composer polls the DAG’s status every 5 minutes and reports False if the one-hour timeout occurs.

Upvotes: 0

Related Questions