Reputation: 2628
I have a lot of Kafka topics with 1 partition each, being produced and consumed to/from (REST API - Kafka - SQL server). But now I want to take periodic dumps of this data to keep in HDFS to perform analytics on later down the line.
Since this basically is just a dump I need, I'm not sure that I need spark streaming. However all documentation and examples use Spark streaming for this.
Is there a way to populate a DF/RDD from a Kafka topic without having a streaming job running? Or is the paradigm here to kill the "streaming" job once the set window of min-to-max offset have been processed? And thus treating the streaming job as a batch job.
Upvotes: 0
Views: 1818
Reputation: 32050
As you've correctly identified, you do not have to use Spark Streaming for this. One approach would be to use the HDFS connector for Kafka Connect. Kafka Connect is part of Apache Kafka. It takes a Kafka topic and writes messages from it to HDFS. You can see the documentation for it here.
Upvotes: 1
Reputation: 850
Kafka is a stream processing platform, so using with spark streaming is easy.
You could use Spark streaming and then check point the data at specified intervals, which fulfills your requirement.
For more on check pointing : - https://spark.apache.org/docs/2.0.2/streaming-programming-guide.html#checkpointing
Upvotes: 0
Reputation: 85
You can use createRDD
method of KafkaUtils
to have spark batch job.
Similar question has been answered here- Read Kafka topic in a Spark batch job
Upvotes: 1