Reputation: 2661
I came across the following code which processes messages in Spark Streaming:
val listRDD = ssc.socketTextStream(host, port)
listRDD.foreachRDD(rdd => {
rdd.foreachPartition(partition => {
// Should I start a separate thread for each RDD and/or Partition?
partition.foreach(message => {
Processor.processMessage(message)
})
})
})
This is working for me but I am not sure if this is the best way. I understand that a DStream consists of "one to many" RDDs, but this code processes RDDs sequentially one after the other, right? Isn't there a better way - a method or function - that I can use so that all the RDDs in the DStream get processed in parallel? Should I start a separate thread for each RDD and/or Partition? Have I misunderstood how this code works under Spark?
Somehow I think this code is not taking advantage of the parallelism in Spark.
Upvotes: 0
Views: 1229
Reputation: 5999
Streams are partitioned in small RDDs for convenience and efficiency (check micro-batching. But you really don't need to break every RDD into partitions or even break the stream into RDDs.
It all depends on what Processor.processMessage
really is. If it is a single transformation function, you can just do listRDD.map(Processor.processMessage)
and you get a stream of whatever the result of processing a message is, computed in parallel with no need for you to do much else.
If Processor
is a mutable object that holds state (say, counting the number of messages) then things are more complicated, as you will need to define many such objects to account for parallelism and will also need to somehow merge results later on.
Upvotes: 2