Sorin
Sorin

Reputation: 910

Multiple stages are needed for all the tasks to finish

I have a spark job that looks like this:

rdd.keyBy(lambda x: (x.id, x.location))
   .aggregateByKey('my 3 aggregation parameters')
   .map(expensiveMapFunction)
   .collect()

The map step is very expensive and I was expecting that all the tasks which run the map to execute in parallel, since the number of partitions is large enough (equal to the number of keys). However, the job appears to have many stages (usually 2 or 3) and only a few tasks do actual computation on each stage while the rest of the tasks do not have anything to do. If all the tasks ran at once, the job would finish in a single stage, but now it takes three times longer because the tasks seem to run in 3 batches.

What could cause this behavior?

Upvotes: 1

Views: 789

Answers (1)

zero323
zero323

Reputation: 330353

I think you have a wrong impression about meaning of stage.

Job which corresponds to the code snippet you've shown requires at least two stages (or three if you want to count result stages). Each stage in Spark is a set of local operations which produces output for the shuffle.

Assuming that rdd which you use as an input doesn't require shuffling you need:

  • one stage to compute rdd and mapSideCombine part of aggregateByKey with seqFunc
  • one stage to compute merge part of aggregateByKey with combFunc and subsequent map with expensiveMapFunction

Number of stages is completely defined by corresponding DAG and cannot change without changing lineage.

Edit (based on additional information from the comments):

If you're actually concerned about the number of active tasks after aggregateByKey this is typically a symptom of significant data skew. If number of frequent keys is low you can expect that most of the data will be assigned to only a few partitions during the shuffle.

Unfortunately there is no universal solution in case like this. Depending on the aggregation logic and expensiveMapFunction you may try to use some salting to obtain a better data distribution.

Upvotes: 2

Related Questions