Reputation: 680
Lately I've been tuning the performance of some large, shuffle heavy jobs. Looking at the spark UI, I noticed an option called "Shuffle Read Blocked Time" under the additional metrics section.
This "Shuffle Read Blocked Time" seems to account for upwards of 50% of the task duration for a large swath of tasks.
While I can intuit some possibilities for what this means, I can't find any documentation that explains what it actually represents. Needless to say, I also haven't been able to find any resources on mitigation strategies.
Can anyone provide some insight into how I might reduce Shuffle Read Blocked Time?
Upvotes: 8
Views: 4507
Reputation: 863
"Shuffle Read Blocked Time" is the time that tasks spent blocked waiting for shuffle data to be read from remote machines. The exact metric it feeds from is shuffleReadMetrics.fetchWaitTime.
Hard to give input into a strategy to mitigate it without actually knowing what data you're trying to read or what sort of remote machines you're reading from. However, consider the following:
As to the metrics, this documentation should shed some light on them: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-webui-StagePage.html
Lastly, i did also find it hard to find information on Shuffle Read Blocked Time, but if you put in quotes like: "Shuffle Read Blocked Time" in a google search, you'll find some decent results.
Upvotes: 4