Reputation: 3374
I need to run a spark program which has huge amount of data. I am trying to optimize the spark program and working through spark UI and trying to reduce the Shuffle part.
There are couple of components mentioned, shuffle read and shuffle write. I can understand the difference based their terminology, but I would like to understand the exact meaning of them and which one of spark's shuffle read/write reduces the performance?
I have searched over the internet, but could not find solid in depth details about them, so wanted to see if any one can explain them here.
Upvotes: 13
Views: 20262
Reputation: 620
I've recently begun working with Spark. I have been looking for answers to the same sort of questions.
When the data from one stage is shuffled to a next stage through the network, the executor(s) that process the next stage pull the data from the first stage's process through TCP. I noticed the shuffle "write" and "read" metrics for each stage are displayed in the Spark UI for a particular job. A stage also potentially had an "input" size (eg. input from HDFS or hive table scan).
I noticed that the shuffle write size from one stage that fed into another stage did not match that stages shuffle read size. If I remember correctly, there are reducer-type operations that can be performed on the shuffle data prior to it being transferred to the next stage/executor as an optimization. Maybe this contributes to the difference in size and therefore the relevance of reporting both values.
Upvotes: 2
Reputation: 4427
From the UI tooltip
Shuffle Read
Total shuffle bytes and records read (includes both data read locally and data read from remote executors
Shuffle Write
Bytes and records written to disk in order to be read by a shuffle in a future stage
Upvotes: 12