Multiple stages are needed for all the tasks to finish

Question

I have a spark job that looks like this:

rdd.keyBy(lambda x: (x.id, x.location))
   .aggregateByKey('my 3 aggregation parameters')
   .map(expensiveMapFunction)
   .collect()

The map step is very expensive and I was expecting that all the tasks which run the map to execute in parallel, since the number of partitions is large enough (equal to the number of keys). However, the job appears to have many stages (usually 2 or 3) and only a few tasks do actual computation on each stage while the rest of the tasks do not have anything to do. If all the tasks ran at once, the job would finish in a single stage, but now it takes three times longer because the tasks seem to run in 3 batches.

What could cause this behavior?

Multiple stages are needed for all the tasks to finish

Answers (1)

Related Questions