Anand
Anand

Reputation: 21338

Application logging in Executor/Worker using Azure Databricks python notebooks

I am using Azure Databricks for building and running ETL pipelines. For development, using Databricks notebooks (Python). My goal is to view the application logs via Spark UI for both codes running on driver and executors.

Initially, I was facing issue to view executor logs but as described here https://kb.databricks.com/clusters/set-executor-log-level.html, I am able to view the application logs put inside the code running on worker nodes(executors) like forEach/forEachPartitions.

As written in the above link that we need to set the log level on all executors. Does that mean we need to set the log level inside each code/method meant to be run on worker nodes like below. So will I have to set the logging level in each method, which I think is redundant and should be avoided.

 def doSomething():
   logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
   ## some operation


 df.forEach(lambda x: doSomething())

To set the log level on all executors, you must set it inside the JVM on each worker.

Is there any better way to do which avoids setting up the log level everytime?

Upvotes: 0

Views: 166

Answers (1)

As you mentioned you want to view application logs for both driver and executors in the Spark UI.

To captures driver logs:

import logging
sc = SparkContext.getOrCreate()
log = logging.getLogger('py4j')
log.setLevel(logging.INFO)
def log_executor_operations(iterator):
    for x in iterator:
        yield x
rdd = sc.parallelize(range(5))
rdd.foreachPartition(log_executor_operations)

In the above using logging.getLogger('py4j') to set the log level for the driver process. This will capture logs generated within the driver code and show them in the Spark UI. Defining log_executor_operations that iterates over an RDD partition.

To capture executor logs:

from pyspark import SparkContext
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
def log_executor_operations(iterator):
    import logging
    executor_log = logging.getLogger("ExecutorLog")
    executor_log.setLevel(logging.DEBUG)
    executor_log.info("Executor has started processing.")
    
    for x in iterator:
        executor_log.debug(f"Processing value: {x}")
rdd = sc.parallelize(range(5))
rdd.foreachPartition(log_executor_operations)

setting the log level to "DEBUG" using sc.setLogLevel("DEBUG") will capture executor logs specifically.

Results:

enter image description here

Upvotes: 0

Related Questions