louis lau
louis lau

Reputation: 161

Confusion with Spark streaming multiple input kafka dstreams

I'm new to Spark Streaming. I don't know the difference between the codes below:

A:

 val kafkaDStreams = (1 to 3).map { i =>
      KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams,
        topicsMap, StorageLevel.MEMORY_AND_DISK_SER)
        .map(_._2)
 }
 ssc.union(kafkaDStreams).foreachRDD(......)

B:

val kafkaDStreams = (1 to 3).map { i =>
     KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams,
            topicsMap, StorageLevel.MEMORY_AND_DISK_SER)
     .map(_._2).foreachRDD(......)
}

What differences between the 2 code samples when executed in Spark Streaming App. Any help? Thanks!

Upvotes: 1

Views: 757

Answers (1)

Philip Kendall
Philip Kendall

Reputation: 4314

I'll do this backwards, because it's probably easier to explain that way :-)

In the second example, you create three DStreams via (1 to 3).map { [...] createStream [...] } and then call foreachRDD on them, so you have three separate sets of processing going on in parallel, so your foreachRDD function is called three times for every time period you set on your Spark streaming context - i.e. in the first time period, you'll get one call to foreachRDD for stream 1, one for stream 2 and one for stream 3.

In the first example, you create the same three DStreams, but then call union on them to produce just one DStream with the elements from all three. This means that you get just one call to your foreachRDD function for each time period, but the RDD, but it now contains the elements from all of stream 1, stream 2 and stream 3.

Upvotes: 2

Related Questions