Reputation: 383
I am doing a pivot operation on top of a dataframe in spark-scala. But for the single pivot it is taking multiple jobs to get completed(as per the pic below)
What could be the possible reason?
This is rather a generic question as I experience the same in other actions well.
Upvotes: 1
Views: 654
Reputation: 18108
Specifically, pivot
causes Spark to launch underwater jobs for the pivot values to be pivoted. You can supply them to obviate afaik, but that is not generally done.
show
and reading s3 paths
also cause Spark to generate extra Jobs, also take
, as well schema inference by
Spark.
Most of the blurb on Actions and Jobs is from RDDs. With DF's and Catalysts things like optimization mean things can be initiated so as to improved performance.
Also, the display on Spark UI is hard for many to follow. Often the name of the Job remains the same, but it concerns the work done for wide-transformations
involving shuffling
known as Stages
. groupBy, orderBy, agg
all do their thing based on "shuffle boundaries". It's the way it works. Your code shows those things.
This Spark: disk I/O on stage boundaries explanation may give some insight as well as to what is going on in the background. The output of a grouBy is input to an orderBy over two Stages.
Upvotes: 1