Bishamon Ten
Bishamon Ten

Reputation: 588

SPARK: When will a Dataframe or RDD be removed or till when is it alive, no caching involved. Given: Action is called on some subsequent RDD/DF

I want to know till when a Dataframe or RDD is kept alive or when it dies/removed.Is it different for Dataframe and RDD?

  1. Are all parent Dataframes kept alive in memory till the last Dataframe / RDD is written to Disk or displayed on screen
  2. When a transformation is applied to a Dataframe/RDD then a new Dataframe/RDD is created. In that case will 10 transformations create 10 Dataframe/RDD and will they be alive till the end of the application or final Dataframe/RDD is written to disk? Please see below for sample code

    val transformDF1 =  readDF.withColumn("new_column", sometransformation)
    val transformDF2  = transformDF1.groupBy("col1","col2").agg(sum("col3"))
    transformDF2.write.format("text").save(path)
    
  3. What about in the case when we chain the transformations together before assigning to a variable. Like Below

      val someDF = df
          .where(some_col = "some_val")
          .withColumn("some-page", col("other_page") + 1)
          .drop("other_page")
          .select(col("col1"), col("col2")
            )
    vall someDF1 =  someDF.join(someotherDF, joincond, "inner"). select("somecols")
    val finalDF =  someDF1.distinct()
    finalDF.write.save(path)

In the above code

  1. We have someDF created from a chain of transformations on df dataframe. Now each transformation in the chain creates a Dataframe. So does each Dataframe created by a transformation in the chain remain alive in memory till finalDF is written to a file Or is it that only the Dataframe from the last transformation in the chain which is assigned to variable someDF remains in memory. If latter is the case then till when someDF is retained and if former is the case till when they are retained in memory
  2. What about other dataframe someDF1, what is its lifetime?
  3. In case the chained transformation are not retained as soon as the control moves to new transformation in the chain then is it better to chain as many transformations as possible to help maintain more available memory. But will GC be a catch/bottle neck in case of chained transformations(in case we are chaining them heavily)?

Upvotes: 0

Views: 155

Answers (2)

Strick
Strick

Reputation: 1642

Spark does all executions in Lazy load fashion which means RDD will not be in memory till any action is called. For each wide dependency spark will store intermediate data not RDD please note only intermediate data will be stored not RDD(unless its cached).

Upvotes: 0

Serge Harnyk
Serge Harnyk

Reputation: 1339

The main point of Spark RDDs is that all executions are lazy. It means that it would be no data in memory until calling of any action. Same for DataFrame because DF is actually a wrapper on RDD.

Upvotes: 0

Related Questions