Havnar
Havnar

Reputation: 2628

Consume kafka data to HDFS Spark Batch

I have a lot of Kafka topics with 1 partition each, being produced and consumed to/from (REST API - Kafka - SQL server). But now I want to take periodic dumps of this data to keep in HDFS to perform analytics on later down the line.

Since this basically is just a dump I need, I'm not sure that I need spark streaming. However all documentation and examples use Spark streaming for this.

Is there a way to populate a DF/RDD from a Kafka topic without having a streaming job running? Or is the paradigm here to kill the "streaming" job once the set window of min-to-max offset have been processed? And thus treating the streaming job as a batch job.

Upvotes: 0

Views: 1818

Answers (3)

Robin Moffatt
Robin Moffatt

Reputation: 32050

As you've correctly identified, you do not have to use Spark Streaming for this. One approach would be to use the HDFS connector for Kafka Connect. Kafka Connect is part of Apache Kafka. It takes a Kafka topic and writes messages from it to HDFS. You can see the documentation for it here.

Upvotes: 1

Deepan Ram
Deepan Ram

Reputation: 850

Kafka is a stream processing platform, so using with spark streaming is easy.

You could use Spark streaming and then check point the data at specified intervals, which fulfills your requirement.

For more on check pointing : - https://spark.apache.org/docs/2.0.2/streaming-programming-guide.html#checkpointing

Upvotes: 0

manohar amrutkar
manohar amrutkar

Reputation: 85

You can use createRDD method of KafkaUtils to have spark batch job.

Similar question has been answered here- Read Kafka topic in a Spark batch job

Upvotes: 1

Related Questions