Reputation: 15435
I'm reading through this blog post:
It discusses about using Spark Streaming and Apache Kafka to do some near real time processing. I completely understand the article. It does show how I could use Spark Streaming to read messages from a Topic. I would like to know if there is a Spark Streaming API that I can use to write messages into Kakfa topic?
My use case is pretty simple. I have a set of data that I can read from a given source at a constant interval (say every second). I do this using reactive streams. I would like to do some analytics on this data using Spark. I want to have fault-tolerance, so Kafka comes into play. So what I would essentially do is the following (Please correct me if I was wrong):
One another question though, is the Streaming API in Spark an implementation of the reactive streams specification? Does it have back pressure handling (Spark Streaming v1.5)?
Upvotes: 4
Views: 2628
Reputation: 57
If you have to write the results stream to another Kafka topic let say 'topic_x', First, you must have columns by the name 'Key' and 'Value' in the result stream which you are trying to write to the topic_x.
result_stream = result_stream.selectExpr('CAST (key AS STRING)','CAST (value AS STRING)')
kafkaOutput = result_stream \
.writeStream \
.format('kafka') \
.option('kafka.bootstrap.servers','192.X.X.X:9092') \
.option('topic','topic_x') \
.option('checkpointLocation','./resultCheckpoint') \
.start()
kafkaOutput.awaitTermination()
For more details check the documentation at https://spark.apache.org/docs/2.4.1/structured-streaming-kafka-integration.html
Upvotes: 0
Reputation: 11985
But Spark Streaming 1.5 has internal back-pressure-based dynamic throttling. There's some work to extend that beyond throttling in the pipeline. This throttling is compatible with the Kafka direct stream API.
You can write to Kafka in a Spark Streaming application, here's one example.
(Full disclosure: I'm one of the implementers of some of the back-pressure work)
Upvotes: 6