Reputation: 588
I want to know till when a Dataframe or RDD is kept alive or when it dies/removed.Is it different for Dataframe and RDD?
When a transformation is applied to a Dataframe/RDD then a new Dataframe/RDD is created. In that case will 10 transformations create 10 Dataframe/RDD and will they be alive till the end of the application or final Dataframe/RDD is written to disk? Please see below for sample code
val transformDF1 = readDF.withColumn("new_column", sometransformation)
val transformDF2 = transformDF1.groupBy("col1","col2").agg(sum("col3"))
transformDF2.write.format("text").save(path)
What about in the case when we chain the transformations together before assigning to a variable. Like Below
val someDF = df
.where(some_col = "some_val")
.withColumn("some-page", col("other_page") + 1)
.drop("other_page")
.select(col("col1"), col("col2")
)
vall someDF1 = someDF.join(someotherDF, joincond, "inner"). select("somecols")
val finalDF = someDF1.distinct()
finalDF.write.save(path)
In the above code
Upvotes: 0
Views: 155
Reputation: 1642
Spark does all executions in Lazy load fashion which means RDD will not be in memory till any action is called. For each wide dependency spark will store intermediate data not RDD please note only intermediate data will be stored not RDD(unless its cached).
Upvotes: 0
Reputation: 1339
The main point of Spark RDDs is that all executions are lazy. It means that it would be no data in memory until calling of any action. Same for DataFrame because DF is actually a wrapper on RDD.
Upvotes: 0