Reputation: 4694
I have a spark streaming application that reads data from Kafka through network. It is important to note that the cluster and the Kafka servers are in different geographies.
The average time to complete a job is around 8-10 minutes (I am running 10 minute intervals). However in certain batches the job complete time shoots up. The amount by which it shoots up is random in general (could be 20 minutes or 50 minutes or an hour). Upon digging I found that all tasks complete on time except one because of which the whole turnaround time is affected. For example here is the task time log from one such instance:
In this case task 6 has taken 54 mins while the others have finished very quickly even though the input split is the same. I have accounted this to network issues (slow/clogged) and am of the opinion that restarting of this task could have saved a lot of time.
Does spark allow some control through which we can restart slow tasks on a different node and then use the results for the task which was completed first? Or does there exist a better solution to this problem that I am unaware of?
Upvotes: 2
Views: 1859
Reputation: 453
I would definitely have a look at the spark.speculation.*
configuration parameters and set them up to be a lot more aggressive, for example in your case those parameters would be pretty appropriate I think:
spark.speculation = true
spark.speculation.interval = 1min
(How often Spark will check for tasks to speculate.)spark.speculation.multiplier = 1.1
(How many times slower a task is than the median to be considered for speculation.)spark.speculation.quantile = 0.5
(Percentage of tasks which must be complete before speculation is enabled for a particular stage.)You can find the full list of configuration parameters here.
Upvotes: 1