How to improve Cloud composer health?

Question

I recently built 120 dags using cloud composer. They all functioned for a while.

They were all approximately the same. Each used python operator. Each made API calls to google search console. Each collected 7-9k rows of GSC data into a pandas dataframe, then uploaded this to GCS buckets and BigQuery (partitioned and clustered).

Occasionally I'd have all fail one day because the GSC auth token had been revoked, but no problem, create new credentials, upload and continue. That situation lasted a couple of months. Now nothing runs.

From the start, the cloud composer health had occasional red spots, but now the health is static red every day.

I have found documentation about how to check the health, but not how to find why the health is so poor and fix it.

Can anyone point me in the right direction?

Sakshi Gatyan · Accepted Answer

The environment health metric depends on a Composer-managed DAG named airflow_monitoring which is triggered periodically by the airflow-monitoring pod. If this DAG isn't deleted, you can check the airflow-monitoring logs to see if there are any problems related to reading the DAG's run statuses. Consequently, you can also try troubleshooting the error in Cloud Logging using the filter:

resource.type="cloud_composer_environment"
severity=ERROR

The liveness check failure could be due to the following reasons:

Any resource constraint(Memory and CPU)
Known issue with the composer version. Please check composer release
notes for any known issues.
Airflow configuration as core:default_timezone(If you’ve configured core: default_timezone airflow configuration composer environment health will be shown as unhealthy. It is a known issue and the composer product team is working on the resolution.)

Refer to this documentation for information on Cloud Composer’s environment health metric.

How to improve Cloud composer health?

Answers (2)

Related Questions