Processing RDDs in a DStream in parallel

Question

I came across the following code which processes messages in Spark Streaming:

val listRDD = ssc.socketTextStream(host, port)
listRDD.foreachRDD(rdd => {
  rdd.foreachPartition(partition => {
    // Should I start a separate thread for each RDD and/or Partition?
    partition.foreach(message => {
      Processor.processMessage(message)
    })
  })
})

This is working for me but I am not sure if this is the best way. I understand that a DStream consists of "one to many" RDDs, but this code processes RDDs sequentially one after the other, right? Isn't there a better way - a method or function - that I can use so that all the RDDs in the DStream get processed in parallel? Should I start a separate thread for each RDD and/or Partition? Have I misunderstood how this code works under Spark?

Somehow I think this code is not taking advantage of the parallelism in Spark.

Daniel Langdon · Accepted Answer

Streams are partitioned in small RDDs for convenience and efficiency (check micro-batching. But you really don't need to break every RDD into partitions or even break the stream into RDDs.

It all depends on what Processor.processMessage really is. If it is a single transformation function, you can just do listRDD.map(Processor.processMessage) and you get a stream of whatever the result of processing a message is, computed in parallel with no need for you to do much else.

If Processor is a mutable object that holds state (say, counting the number of messages) then things are more complicated, as you will need to define many such objects to account for parallelism and will also need to somehow merge results later on.

Processing RDDs in a DStream in parallel

Answers (1)

Related Questions