How to use RDD checkpointing to share datasets across Spark applications?

Question

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):

@Test
  def testCheckpoint1(): Unit = {
    val data = List("Hello", "World", "Hello", "One", "Two")
    val rdd = sc.parallelize(data)
    //sc is initialized in the setup 
    sc.setCheckpointDir(Utils.getOutputDir())
    rdd.checkpoint()
    rdd.collect()
  }

When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code, and make it as an RDD as a starting point in this second application

The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.

So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.

I would ask if there is a way to do it.

Jacek Laskowski · Accepted Answer

Quoting the scaladoc of RDD.checkpoint (highlighting mine):

checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.

So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.

Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.

Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):

Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.

So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.

I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

How to use RDD checkpointing to share datasets across Spark applications?

Answers (1)

Related Questions