HADOOP_CONF_DIR seems to overrule SPARK_CONF_DIR for log4j configuration

Question

Hi we're running a spark driver program in pyspark in yarn-client mode spark version = spark 3.2.1

We have following environment variables set:

HADOOP_CONF_DIR = points to a folder containing all hadoop configuration files like hdfs-site.xml, hive-site.xml, etc. It also contains a log4j.properties file.
SPARK_CONF_DIR = points to a folder containing the spark-defaults file and the log4j2.properties file

These are the contents of the log4j.properties file in the folder referred to by HADOOP_CONF_DIR:

log4j.rootLogger=${hadoop.root.logger}
hadoop.root.logger=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n

These are the contents of the log4j2.properties file in the folder referred bo by SPARK_CONF_DIR:

# Log files location
property.basePath = ${env:LOG_PATH}
# Set everything to be logged to the console
appender.rolling.type = RollingFile
appender.rolling.name = fileLogger
appender.rolling.fileName= ${basePath}/vdp-ingestion.log
appender.rolling.filePattern= ${basePath}/vdp-ingestion_%d{yyyyMMdd}.log.gz
# log in json-format -> based on LogstashJsonEventLayout
appender.rolling.layout.type = JsonTemplateLayout
appender.rolling.layout.eventTemplateUri = classpath:LogstashJsonEventLayoutV1.json
# overrule message -> by default treated as a string, however we want an object so we can use the native JSON format
# and use the underlying objects in kibana log filters
appender.rolling.layout.eventTemplateAdditionalField[0].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[0].key = message
appender.rolling.layout.eventTemplateAdditionalField[0].value = {"$resolver": "message", "fallbackKey": "message"}
appender.rolling.layout.eventTemplateAdditionalField[0].format = JSON
appender.rolling.layout.eventTemplateAdditionalField[1].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[1].key = pid
appender.rolling.layout.eventTemplateAdditionalField[1].value = {"$resolver": "pattern", "pattern": "%pid"}
appender.rolling.layout.eventTemplateAdditionalField[1].format = JSON
appender.rolling.layout.eventTemplateAdditionalField[2].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[2].key = tid
appender.rolling.layout.eventTemplateAdditionalField[2].value = {"$resolver": "pattern", "pattern": "%tid"}
appender.rolling.layout.eventTemplateAdditionalField[2].format = JSON

appender.rolling.policies.type = Policies
# RollingFileAppender rotation policy
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 10MB
appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.rolling.policies.time.interval = 1
appender.rolling.policies.time.modulate = true
appender.rolling.strategy.type = DefaultRolloverStrategy
appender.rolling.strategy.delete.type = Delete
appender.rolling.strategy.delete.basePath = ${basePath}
appender.rolling.strategy.delete.maxDepth = 10
appender.rolling.strategy.delete.ifLastModified.type = IfLastModified

# Delete all files older than 30 days
appender.rolling.strategy.delete.ifLastModified.age = 30d

rootLogger.level = INFO
rootLogger.appenderRef.rolling.ref = fileLogger

logger.spark.name = org.apache.spark
logger.spark.level = WARN
logger.spark.additivity = false
logger.spark.appenderRef.stdout.ref = fileLogger

# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
logger.spark.repl.Main.level = WARN
logger.spark.repl.SparkIMain$exprTyper.level = INFO
logger.spark.repl.SparkILoop$SparkILoopInterpreter.level = INFO

# Settings to quiet third party logs that are too verbose
logger.jetty.name = org.sparkproject.jetty
logger.jetty.level = WARN
logger.jetty.util.component.AbstractLifeCycle.level = ERROR

logger.parquet.name = org.apache.parquet
logger.parquet.level = ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.hadoop.name = org.apache.hadoop
logger.hadoop.level = WARN
logger.hadoop.hive.metastore.RetryingHMSHandler.level = FATAL
logger.hadoop.hive.ql.exec.FunctionRegistry.level = ERROR
logger.spark.sql.level = WARN

When we start up the pyspark program it finds the log4j2.properties file and we can see that all non root level logs are captured in json for all dependencies

However, for some reason the settings of the log4j.properties apply to the spark driver logs and all these are reported to the console instead. If we change the level or format in the log4j.properties file these settings are applied to the driver log output.

Is there any reason why spark would use the hadoop log4j.properties file instead of the log4j2.properties file? Are we missing some setting here?

We also tried to provide the log4j2.properties file to the drivers extra java options in spark-defaults:

spark.driver.extraJavaOptions -Djava.net.preferIPv4Stack=true -Djava.security.auth.login.config=conf/jaas_driver.conf -Djava.security.krb5.conf=conf/krb5_driver.conf -Dsun.security.krb5.debug=false -Dlog4j.configurationFile=file:/spark_conf_dir/log4j2.properties

where spark_conf_dir = the folder referred to by SPARK_CONF_DIR

But also this didn't work. For some reason the system always applies the log4j.properties settings for the driver program. It seems that it overrules the settings in the log4j2.properties file with the settings in the log4j.properties file.

This is on a virtual machine. If we remove the log4j.properties file in the HADOOP_CONF_DIR then nothing gets reported for the driver program (maybe the default error but currently nothing shows up).

If we build up a docker instead with the same program but from a base python image with pyspark we don't have this issue. Then the log output of the driver program and dependent spark packages are delivered in the log file in json format.

I would expect that providing -Dlog4j.configurationFile=file:/spark_conf_dir/log4j2.properties in the spark.driver.extraJavaOptions would solve the issue. Or that SPARK_CONF_DIR would take precedence over HADOOP_CONF_DIR for the log4j configuration.

HADOOP_CONF_DIR seems to overrule SPARK_CONF_DIR for log4j configuration

Answers (1)

Related Questions