Reputation: 21
Hi we're running a spark driver program in pyspark in yarn-client mode spark version = spark 3.2.1
We have following environment variables set:
These are the contents of the log4j.properties file in the folder referred to by HADOOP_CONF_DIR:
log4j.rootLogger=${hadoop.root.logger}
hadoop.root.logger=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
These are the contents of the log4j2.properties file in the folder referred bo by SPARK_CONF_DIR:
# Log files location
property.basePath = ${env:LOG_PATH}
# Set everything to be logged to the console
appender.rolling.type = RollingFile
appender.rolling.name = fileLogger
appender.rolling.fileName= ${basePath}/vdp-ingestion.log
appender.rolling.filePattern= ${basePath}/vdp-ingestion_%d{yyyyMMdd}.log.gz
# log in json-format -> based on LogstashJsonEventLayout
appender.rolling.layout.type = JsonTemplateLayout
appender.rolling.layout.eventTemplateUri = classpath:LogstashJsonEventLayoutV1.json
# overrule message -> by default treated as a string, however we want an object so we can use the native JSON format
# and use the underlying objects in kibana log filters
appender.rolling.layout.eventTemplateAdditionalField[0].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[0].key = message
appender.rolling.layout.eventTemplateAdditionalField[0].value = {"$resolver": "message", "fallbackKey": "message"}
appender.rolling.layout.eventTemplateAdditionalField[0].format = JSON
appender.rolling.layout.eventTemplateAdditionalField[1].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[1].key = pid
appender.rolling.layout.eventTemplateAdditionalField[1].value = {"$resolver": "pattern", "pattern": "%pid"}
appender.rolling.layout.eventTemplateAdditionalField[1].format = JSON
appender.rolling.layout.eventTemplateAdditionalField[2].type = EventTemplateAdditionalField
appender.rolling.layout.eventTemplateAdditionalField[2].key = tid
appender.rolling.layout.eventTemplateAdditionalField[2].value = {"$resolver": "pattern", "pattern": "%tid"}
appender.rolling.layout.eventTemplateAdditionalField[2].format = JSON
appender.rolling.policies.type = Policies
# RollingFileAppender rotation policy
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 10MB
appender.rolling.policies.time.type = TimeBasedTriggeringPolicy
appender.rolling.policies.time.interval = 1
appender.rolling.policies.time.modulate = true
appender.rolling.strategy.type = DefaultRolloverStrategy
appender.rolling.strategy.delete.type = Delete
appender.rolling.strategy.delete.basePath = ${basePath}
appender.rolling.strategy.delete.maxDepth = 10
appender.rolling.strategy.delete.ifLastModified.type = IfLastModified
# Delete all files older than 30 days
appender.rolling.strategy.delete.ifLastModified.age = 30d
rootLogger.level = INFO
rootLogger.appenderRef.rolling.ref = fileLogger
logger.spark.name = org.apache.spark
logger.spark.level = WARN
logger.spark.additivity = false
logger.spark.appenderRef.stdout.ref = fileLogger
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
logger.spark.repl.Main.level = WARN
logger.spark.repl.SparkIMain$exprTyper.level = INFO
logger.spark.repl.SparkILoop$SparkILoopInterpreter.level = INFO
# Settings to quiet third party logs that are too verbose
logger.jetty.name = org.sparkproject.jetty
logger.jetty.level = WARN
logger.jetty.util.component.AbstractLifeCycle.level = ERROR
logger.parquet.name = org.apache.parquet
logger.parquet.level = ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.hadoop.name = org.apache.hadoop
logger.hadoop.level = WARN
logger.hadoop.hive.metastore.RetryingHMSHandler.level = FATAL
logger.hadoop.hive.ql.exec.FunctionRegistry.level = ERROR
logger.spark.sql.level = WARN
When we start up the pyspark program it finds the log4j2.properties file and we can see that all non root level logs are captured in json for all dependencies
However, for some reason the settings of the log4j.properties apply to the spark driver logs and all these are reported to the console instead. If we change the level or format in the log4j.properties file these settings are applied to the driver log output.
Is there any reason why spark would use the hadoop log4j.properties file instead of the log4j2.properties file? Are we missing some setting here?
We also tried to provide the log4j2.properties file to the drivers extra java options in spark-defaults:
spark.driver.extraJavaOptions -Djava.net.preferIPv4Stack=true -Djava.security.auth.login.config=conf/jaas_driver.conf -Djava.security.krb5.conf=conf/krb5_driver.conf -Dsun.security.krb5.debug=false -Dlog4j.configurationFile=file:/spark_conf_dir/log4j2.properties
where spark_conf_dir = the folder referred to by SPARK_CONF_DIR
But also this didn't work. For some reason the system always applies the log4j.properties settings for the driver program. It seems that it overrules the settings in the log4j2.properties file with the settings in the log4j.properties file.
This is on a virtual machine. If we remove the log4j.properties file in the HADOOP_CONF_DIR then nothing gets reported for the driver program (maybe the default error but currently nothing shows up).
If we build up a docker instead with the same program but from a base python image with pyspark we don't have this issue. Then the log output of the driver program and dependent spark packages are delivered in the log file in json format.
I would expect that providing -Dlog4j.configurationFile=file:/spark_conf_dir/log4j2.properties in the spark.driver.extraJavaOptions would solve the issue. Or that SPARK_CONF_DIR would take precedence over HADOOP_CONF_DIR for the log4j configuration.
Upvotes: 1
Views: 473
Reputation: 21
I finally found the root cause for the above issue and document it here so that if someone hits the same kind of issue, it hopefully helps him/her.
The root cause for the interfering log4j 1 and log4j 2 configurations was due to a jar file that was available in the directory referred to by SPARK_HOME in one environment and not in the other environment.
This jar file has log4j 1 logging embedded and hence looked for a log4j.properties file on the classpath which it found in the HADOOP_CONF_DIR and hence applied these settings for the root logger of the spark driver program.
After removing this jar, all logging occurred through log4j2 with the log4j2.properties file available in SPARK_CONF_DIR.
Upvotes: 0