How do I emit Airflow logs from a DataProcPySparkOperator

Question

I've seen several bits on emitting logs from PythonOperator, and for configuring Airflow logs, but haven't found anything that will let me emit logs from within a containerized process, e.g. the DataProcPySparkOperator.

I've gone so far as to including the following at the top of the pyspark script that is run inside the Operator's cluster:

import logging
logging.info('Test bare logger')
for ls in ['airflow', 'airflow.task', __name__]:
    l = logging.getLogger(ls)
    l.info('Test {} logger'.format(ls))
print('Test print() logging')

It produces no output, although the Operator script otherwise runs as intended.

I assume that I could build a connection to cloud storage (or the DB) from within the cluster, perhaps piggybacking off the existing connection used to read & write files, but ... that seems like a lot of work for a common need. I would very much like to get occasionally-referenced status checks about the number of records or other data at intermediate stages of the computation.

Does Airflow set up a Python logger in the cluster by default? If so, how do I access it?

Amine Sagaama · Accepted Answer

If you use Cloud Composer to run Airflow, you should know that Cloud Composer includes only Airflow logs and Streaming logs.

When you run PySpark Job in a Dataproc Cluster, job driver output is stored in Cloud Storage (see Accessing job driver output)

You can also enable Dataproc to save job driver logs in Cloud Logging.

As described in Dataproc documentation, to enable job driver logs in Logging, set the following cluster property when creating the cluster:

dataproc:dataproc.logging.stackdriver.job.driver.enable=true

The following cluster properties are also required, and are set by default when a cluster is created:

dataproc:dataproc.logging.stackdriver.enable=true
dataproc:jobs.file-backed-output.enable=true

How do I emit Airflow logs from a DataProcPySparkOperator

Answers (1)

Related Questions