Jaysheel Kalgal
Jaysheel Kalgal

Reputation: 46

Dataproc: update log level in Spark shell

I use Jupyter terminal for accessing the driver of Dataproc cluster. This is my gateway to the cluster, and I do not have direct SSH enabled for the driver machine. When I launch spark-shell , I keep getting these info, debug, Contextcleaner messages throughout my session and kind of disturbs my coding efforts. Is there a way to turn these off ?

scala> 22/10/11 15:47:31 INFO org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.22.86.219:43504) with ID 2
22/10/11 15:47:31 INFO org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.22.86.217:54770) with ID 1
22/10/11 15:47:31 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block manager cluster:39607 with 5.6 GB RAM, BlockManagerId(2, cluster, 39607, None)
22/10/11 15:47:31 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block manager cluster.internal:36731 with 5.6 GB RAM, BlockManagerId(1, cluster, 36731, None)
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:47:31 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://bucket/application_1665502930299_0001.lz4.inprogress
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 56
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 31
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 63
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 30
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 44
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 32
22/10/11 15:54:51 INFO org.apache.spark.ContextCleaner: Cleaned accumulator 35
22/10/11 15:54:53 INFO org.apache.spark.storage.memory.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 23.1 KB, free 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.storage.memory.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 7.6 KB, free 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on clusterurl:33625 (size: 7.6 KB, free: 3.8 GB)
22/10/11 15:54:53 INFO org.apache.spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1184
22/10/11 15:54:53 INFO org.apache.spark.scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[9] at show at <console>:39) (first 15 tasks are for partitions Vector(1))
22/10/11 15:54:53 INFO org.apache.spark.scheduler.cluster.YarnScheduler: Adding task set 4.0 with 1 tasks
22/10/11 15:54:53 INFO org.apache.spark.scheduler.FairSchedulableBuilder: Added task set TaskSet_4.0 tasks to pool default
22/10/11 15:54:53 INFO org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 7, cluster.internal, executor 1, partition 1, PROCESS_LOCAL, 7908 bytes)
22/10/11 15:54:53 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on cluster.internal:36731 (size: 7.6 KB, free: 5.6 GB)
22/10/11 15:54:54 INFO org.apache.spark.scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 7) in 558 ms on cluster.internal (executor 1) (1/1)
22/10/11 15:54:54 INFO org.apache.spark.scheduler.cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool default
22/10/11 15:54:54 INFO org.apache.spark.scheduler.DAGScheduler: ResultStage 4 (show at <console>:39) finished in 0.571 s
22/10/11 15:54:54 INFO org.apache.spark.scheduler.DAGScheduler: Job 4 finished: show at <console>:39, took 0.575517 s

Upvotes: 2

Views: 752

Answers (1)

Dagang Wei
Dagang Wei

Reputation: 26498

The logs are controlled by /etc/spark/conf/log4j.properties, the default root log level is INFO, but in spark-shell, the root level is overridden as WARN. I guess the reason you see logs like INFO org.apache.spark.scheduler.DAGScheduler is because your cluster has settings like log4j.logger.org.apache.spark=INFO in the file.

There are several way you can change log settings for spark-shell:

Session level

  1. Run sc.setLogLevel("WARN") in spark-shell which will update the root log level for the whole process. It has the same effect as
scala> import org.apache.log4j.{Level, Logger}
scala> Logger.getRootLogger().setLevel(Level.WARN)
  1. Get the specific logger and set level, e.g.:
scala> import org.apache.log4j.{Level, Logger}
scala> Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
  1. Make a copy of /etc/spark/conf/log4j.properties to /tmp/spark-log4j.properties, edit it with the desired log settings, then run spark-shell --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///tmp/spark-log4j.properties.

Cluster level

  1. Edit /etc/spark/conf/log4j.properties and set higher log levels for the spammy packages, then run spark-shell.

  2. When creating the cluster, add --properties ^#^spark-log4j:log4j.logger.org.apache.spark=WARN#..., which will update the config file under the hood.

Upvotes: 1

Related Questions