Apache Spark: Streaming without HDFS checkpoint

Question

I'm implementing a Spark job which makes use of reduceByKeyAndWindow, therefore I need to add checkpointing.

From Spark's website I see that:

Checkpointing can be enabled by setting a directory in a fault-tolerant, reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved.

My application is just for academic purposes, thus I don't want to set an HDFS for checkpointing but just a local file. Doing so in MacOS works fine (setting a temporary dir as checkpoint dir), the problem comes when doing it in Windows, which throws an exception for permissions.

I already tried starting eclipse as administrator and creating the directory manually setting setWritable, setReadable and setExecutable to true. Any hint on how to overcome the problem in Windows?

Thanks!

Update Here's my code and exception. Just to clarify again, it works fine in Mac but not in Windows.

SparkConf conf = new SparkConf().setAppName("testApp").setMaster("local[2]");
JavaSparkContext ctx = new JavaSparkContext(conf);
JavaStreamingContext jsc = new JavaStreamingContext(ctx, new Duration(1000));
jsc.checkpoint(Files.createTempDir().getAbsolutePath());

Exception:

Exception in thread "pool-7-thread-3" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:772)
at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:135)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

sergi123 · Accepted Answer

Solved by adding the latest Hadoop libraries to my project.

If using Maven, the following set of dependencies do the trick.


    
        org.apache.spark
        spark-core_2.10
        1.3.0
    

    
        org.apache.spark
        spark-streaming_2.10
        1.2.1
    

    
        org.apache.spark
        spark-streaming-twitter_2.11
        1.3.0
    

    
        org.apache.hadoop
        hadoop-core
        1.2.1
    

    
        org.apache.hadoop
        hadoop-hdfs
        2.6.0
    

    
        org.apache.hadoop
        hadoop-client
        2.6.0
    

    
        org.apache.hadoop
        hadoop-common
        2.6.0

Apache Spark: Streaming without HDFS checkpoint

Answers (2)

Related Questions