Spark Streaming - reduceByKey for a Map inside DStream

Question

How do I leverage reduceByKey in Spark / Spark Streaming for a normal Scala Map that resides inside DStream?

I have a DStream[(String, Array[(String, List)])] where I want to apply reduceByKey function to the inside Array[(String, List)] (joining all the lists together)

I am able to do this in normal Spark by converting the outside RDD to normal Array (to avoid serialization error on SparkContext object),

then run a foreach and apply sc.parallelize() to the inside Array[(String, List)]

But since DStream doesn't have any direct conversion to normal array I'm not able to apply sc.parallelize() to the inside component and hence no reduceByKey function.

I'm very new to Spark and Spark Streaming (the whole map-reduce concept actually) and this might not be the right way to do this so if anyone could advise a better practice please do so.

James K · Accepted Answer

This is an old question so hopefully you figured this out but.... in order to be able to perform reduceByKey... operations on a DStream you must first import StreamingContext:

import org.apache.spark.streaming.StreamingContext._

This provides implicit methods extending DStream. Once you've done this not only can you perform stock reduceByKey you can also use time sliced functions like:

reduceByKeyAndWindow((a: List, b: List) => (a ::: b), Seconds(30), Seconds(30))

These can be quite useful if you want to do aggregation within a sliding window. Hope that helps!

Spark Streaming - reduceByKey for a Map inside DStream

Answers (1)

Related Questions