Reputation: 695
I have a long run iteration in my program and I want to cache and checkpoint every few iterations (this technique is suggested to cut long lineage on the web) so I wont have StackOverflowError, by doing this
for (i <- 2 to 100) {
//cache and checkpoint ever 30 iterations
if (i % 30 == 0) {
graph.cache
graph.checkpoint
//I use numEdges in order to start the transformation I need
graph.numEdges
}
//graphs are stored to a list
//here I use the graph of previous iteration to this iteration
//and perform a transformation
}
and I have set the checkpoint directory like this
val sc = new SparkContext(conf)
sc.setCheckpointDir("checkpoints/")
However, when I finally run my program I get an Exception
Exception in thread "main" org.apache.spark.SparkException: Invalid checkpoint directory
I use 3 computers, each computer has Ubuntu 14.04, and I also use a pre-built version of spark 1.4.1 with hadoop 2.4 or later on each computer.
Upvotes: 2
Views: 20368
Reputation: 104
If you already set up HDFS on a cluster of nodes, you can find your hdfs address in "core-site.xml" located in the directory HADOOP_HOME/etc/hadoop
. For me, the core-site.xml is set up as:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
Then you can create a directory on hdfs to save Rdd checkpoint files, let's name this directory RddChekPoint, by hadoop hdfs shell:
$ hadoop fs -mkdir /RddCheckPoint
If you use pyspark, after SparkContext is initialized by sc = SparkContext(conf)
, you can set checkpoint directory by
sc.setCheckpointDir("hdfs://master:9000/RddCheckPoint")
when an Rdd is checkpointed, in the hdfs directory RddCheckPoint, you can see the checkpoint files are saved there, to have a look:
$ hadoop fs -ls /RddCheckPoint
Upvotes: 4
Reputation: 7442
The checkpoint directory needs to be an HDFS compatible directory (from the scala doc "HDFS-compatible directory where the checkpoint data will be reliably stored. Note that this must be a fault-tolerant file system like HDFS"). So if you have HDFS setup on those nodes point it to "hdfs://[yourcheckpointdirectory]".
Upvotes: 2