DGALLENE
DGALLENE

Reputation: 31

Is it possible to work with Spark Structured Streaming without HDFS?

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS. So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS. Chekckpointing with Kafka would be faster and reliable. So is it possible to work with spark structured streaming without HDFS ? It seems strange that we have to use HDFS only for streaming data in Kafka. Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?

Spark 2.4.7

Thank you

Upvotes: 3

Views: 400

Answers (1)

Michael Heil
Michael Heil

Reputation: 18515

You are not restricted to use a HDFS path as a checkpoint location.

According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.

Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

Upvotes: 1

Related Questions