blue-sky
blue-sky

Reputation: 53806

Understanding Spark monitoring UI

For a running Spark job here is part of the UI details for URL : http://localhost:4040/stages/stage/?id=1&attempt=0

enter image description here

The doc at http://spark.apache.org/docs/1.2.0/monitoring.html does not detail each of these parameters. What do the columns "Input" , "Write Time" & "Shuffle Write" indicate ?

As can see from this screenshot these 4 tasks have been running for 1.3 mins and I'm attempting to discover if there is a bottleneck then where it is occurring.

Spark is configured to use 4 cores, I think this is why there are 4 tasks displayed in UI, each task is running on a single core ?

What is determining the "Shuffle Write" sizes ?

On my console output there many log messages :

15/02/11 20:55:33 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:103306+103306 15/02/11 20:55:33 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:0+103306 15/02/11 20:55:33 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:0+103306 15/02/11 20:55:33 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:103306+103306 15/02/11 20:55:33 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:103306+103306 15/02/11 20:55:33 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:0+103306 15/02/11 20:55:33 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:0+103306 15/02/11 20:55:34 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:103306+103306 15/02/11 20:55:34 INFO rdd.HadoopRDD: Input split: file:/c:/data/example.txt:103306+103306 .....................

Are these the result of the files being split into multiple smaller sizes and each "Input" of size 100.9KB (specified in Spark UI screenshot) is mapping to one of these snippets ?

Upvotes: 5

Views: 5798

Answers (2)

user3648294
user3648294

Reputation:

Input is the size of data that your spark job is ingesting. For example, it can be data that each map task you may defined is using.

Shuffle write is defined as bytes written to disk in order for future tasks. So it is the data that spark writes to disk to enable the transmission of your map output. For example, if you are trying a join and data need to be shuffled to other nodes, then this is the data which will be transferred to other nodes.

Task don't run on cores, tasks run on executors. This executor in turn use the cores.

Please also go through link for better understanding about the same.

Upvotes: 8

Sietse
Sietse

Reputation: 201

Not everything is being printed in the logs, especially not any custom code (unless you print it yourself). When something is running for too long, you may want to do a thread dump on one of the executors and look at the stacks to see the progress in your computation.

Upvotes: 0

Related Questions