Reputation: 688
This may be a basic question, but I am having some trouble understanding this.
I am currently using the Microsoft Azure Event Hubs Streaming in my Spark/Scala application which is similar to Kafka.
If I created a Unionized stream, I believe this unions multiple DStream objects abstracted to look like a single DStream, will the multiple RDDs in the stream be processed in parallel, or will each RDD be processed individually?
To try and explain this more, here is a quick example:
sparkConf.set(SparkArgumentKeys.MaxCores, (partitionCount * 2).toString)
val ssc = new StreamingContext(sparkConf, streamDuration)
val stream = EventHubsUtils.createUnionStream(ssc, hubParams, storageLevel)
stream.checkpoint(streamDuration)
val strings = stream.map(f => new String(f))
strings.foreachRDD(rdd => {
rdd.map(f => f.split(' '))
})
partitionCount is the number of partitions in the azure event hub.
Upvotes: 0
Views: 2041
Reputation: 16076
After each batch, so after streamDuration
, Spark will collect all received in this time window data to one RDD, then will map this RDD (again: it's one RDD, but map is done in parallel, just like map in batch job).
As last step, for each RDD your function from foreachRDD
will be executed. For each RDD means that it will be executed on RDD from each micro-batch (time window).
Of course after next streamDuration
time, data will be again collected, RDD will be created (data only from time between last collection and current), map, function given to foreachRDD
Summary: foreachRDD doesn't mean that there will be many RDD executed at one time, but it means that in every micro-batch function will be applied to this micro-batch
Upvotes: 3