Wang Danyang
Wang Danyang

Reputation: 76

How to get spark tasks detail information

By view Spark UI timeline, I find my spark application's last task of a specific stage always cost too much time. It seem the task can't finish forever, I have even waited six times longer time than normal tasks.

I want to get more information about the lask task, but I don't know how to debug this specific task, is there anyone can give me some suggestions?

Thanks for your help!

The data has been partitioned well, so the lask task don't have too much data.

Upvotes: 2

Views: 1441

Answers (1)

  1. Check the explain plan of the resulting dataframe to understand what operations are happening. Are there any shuffles? Sometimes when operations are performed on a dataframe(such as joins) it can result in intermediate dataframes being mapped to a smaller number of partitions and this can cause slower performance because the data isnt as distributed as can be.

  2. Check if there are a lot of shuffles and repeated calls to such dataframes and try to cache the dataframe that comes right after a shuffle.

  3. Check in the Spark UI (address of the driver:4040 is default) and see what the data volume of cached dataframes is, what are the processes and if there are any other overheads such as gc or if it is pure processing time.

Hope that helps.

Upvotes: 2

Related Questions