Reputation: 2714
I have to design a spark streaming application with below use case. I am looking for best possible approach for this.
I have application which pushing data into 1000+ different topics each has different purpose . Spark streaming will receive data from each topic and after processing it will write back to corresponding another topic.
Ex.
Input Type 1 Topic --> Spark Streaming --> Output Type 1 Topic
Input Type 2 Topic --> Spark Streaming --> Output Type 2 Topic
Input Type 3 Topic --> Spark Streaming --> Output Type 3 Topic
.
.
.
Input Type N Topic --> Spark Streaming --> Output Type N Topic and so on.
I need to answer following questions.
Any other issue you guys see here ?
Highly appreicate your response.
Upvotes: 2
Views: 1231
Reputation: 1018
You will have to use the recommended "Direct approach" instead of the Receiver approach, otherwise your application is going to use more than 1000 cores, if you don't have more, it will be able to receive data from your Kafka's topic but not to process them. From Spark Streaming Doc :
Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application.
You can see in the Kafka Integration (there is one for Kafka 0.8 and one for 0.10) doc how to see in which topic belongs a message
If a client add new topics or partitions, you will need to update your Spark Streaming's topics conf, and redeploy it. If you use Kafka 0.10 you can also use RegEx for topics' name, see Consumer Strategies. I've experienced reading from a deleted topic in Kafka 0.8, and there was no problems, still verify ("Trust, but verify")
See Spark Streaming's doc about Fault Tolerance, also use the mode --supervise
when submiting your application to your cluster, see the Deploying documentation for more information
To achieve exactly-once semantic, I suggest this Github from Spark Streaming's main commiter : https://github.com/koeninger/kafka-exactly-once
Bonus, good similar StackOverFlow post : Spark: processing multiple kafka topic in parallel
Bonus2: Watch out for the soon-to-be-released Spark 2.2 and the Structured Streaming component
Upvotes: 5