abc_spark
abc_spark

Reputation: 383

why an action takes multiple jobs to get completed in spark - scala

I am doing a pivot operation on top of a dataframe in spark-scala. But for the single pivot it is taking multiple jobs to get completed(as per the pic below)

What could be the possible reason?

enter image description here

This is rather a generic question as I experience the same in other actions well.

Upvotes: 1

Views: 654

Answers (1)

Ged
Ged

Reputation: 18108

Specifically, pivot causes Spark to launch underwater jobs for the pivot values to be pivoted. You can supply them to obviate afaik, but that is not generally done.

show and reading s3 paths also cause Spark to generate extra Jobs, also take, as well schema inference by Spark.

Most of the blurb on Actions and Jobs is from RDDs. With DF's and Catalysts things like optimization mean things can be initiated so as to improved performance.

Also, the display on Spark UI is hard for many to follow. Often the name of the Job remains the same, but it concerns the work done for wide-transformations involving shuffling known as Stages. groupBy, orderBy, agg all do their thing based on "shuffle boundaries". It's the way it works. Your code shows those things.

This Spark: disk I/O on stage boundaries explanation may give some insight as well as to what is going on in the background. The output of a grouBy is input to an orderBy over two Stages.

Upvotes: 1

Related Questions