Spark Streaming Union Stream - parallelization

Question

This may be a basic question, but I am having some trouble understanding this.

I am currently using the Microsoft Azure Event Hubs Streaming in my Spark/Scala application which is similar to Kafka.

If I created a Unionized stream, I believe this unions multiple DStream objects abstracted to look like a single DStream, will the multiple RDDs in the stream be processed in parallel, or will each RDD be processed individually?

To try and explain this more, here is a quick example:

sparkConf.set(SparkArgumentKeys.MaxCores, (partitionCount * 2).toString)

val ssc = new StreamingContext(sparkConf, streamDuration)

val stream = EventHubsUtils.createUnionStream(ssc, hubParams, storageLevel)
stream.checkpoint(streamDuration)

val strings = stream.map(f => new String(f))
strings.foreachRDD(rdd => {
  rdd.map(f => f.split(' '))
})

partitionCount is the number of partitions in the azure event hub.

Does the initial "stream.map" perform on each RDD in parallel?
Does "string.foreachRDD" process a single RDD at a time, or does it process all the RDDs in some parallel manner?

T. Gawęda · Accepted Answer

After each batch, so after streamDuration, Spark will collect all received in this time window data to one RDD, then will map this RDD (again: it's one RDD, but map is done in parallel, just like map in batch job).

As last step, for each RDD your function from foreachRDD will be executed. For each RDD means that it will be executed on RDD from each micro-batch (time window).

Of course after next streamDuration time, data will be again collected, RDD will be created (data only from time between last collection and current), map, function given to foreachRDD

Summary: foreachRDD doesn't mean that there will be many RDD executed at one time, but it means that in every micro-batch function will be applied to this micro-batch

Spark Streaming Union Stream - parallelization

Answers (1)

Related Questions