Reputation: 91
I am looking to find details of how a spark dataframe is executed if there is dependency on other dataframes.
E.g. Lets say, I have 3 dataframes. DF1, DF2 and DF3.
DF1 - Reads from a table A
DF2 - Reads from DF1.
DF3 - Joins DF1 and DF2.
When I do DF3.show(), would it also execute DF1 and DF2 in backend?
Upvotes: 0
Views: 554
Reputation: 18003
Small question with many aspects.
Spark uses lazy evaluation.
The show() will trigger an Action. A resultStage is what is created.
The built up plan, based on the Action, will have fused code (maps, filters aka narrow transformations) together / optimized if possible within a Stage, so you may not really get a DF2 physically, but you may. It all depends on shuffling over Stages (in general as a result of wide transformations). I.e. do you have wide transformations and special things that Spark initiates in the background, e.g. for a pivot.
In your case:
As it is a lengthy discussion, here is a good link: https://mapr.com/blog/how-spark-runs-your-applications/
I leave caching out, but try the below code and look at the SQL tab as well as Stages tab. Also, try without JOIN and see what happens.
val df1 = spark.read.parquet("simple.parquet")//.cache()
val df2 = df1.filter(df1("country") === "Holland")
val df3 = df1.join(df2, Seq("id"))
df3.show(false)
Upvotes: 0
Reputation: 56
Spark performs the lazy operation. It doesn't execute until action calls.
When we submit a spark application, Spark creates a Directed Acyclic Graph(DAG). It contains lineage graph which helps us recover the data frame in case of any executor failure.
As per DAG, Read is the first operation then transformation. Finally, it will show the result.
Upvotes: 0