Distinct Element across dstream

Question

I am working on window dstreams wherein each dstream contains 3 rdd with following keys:

a,b,c
b,c,d
c,d,e
d,e,f

I want to get only unique keys across all dstream

a,b,c,d,e,f

How to do it in spark streaming?

maasg · Accepted Answer

We could use a window of t+4 intervals to keep a count of "last recently seen keys" and use that to remove duplicates on the current interval.

Something in the lines of this:

// original dstream
val dstream = ??? 
// make distinct (for a single interval) and pair with 1's for counting
val keyedDstream = dstream.transform(rdd=> rdd.distinct).map(e => (e,1))
// keep a window of t*4 with the count of distinct keys we have seen
val windowed = keyedDstream.reduceByKeyAndWindow((x:Int,y:Int) => x+y, Seconds(4),Seconds(1))
// join the windowed count with the initially keyed dstream
val joined = keyedDstream.join(windowed)
// the unique keys though the window are those with a running count of 1 (only seen in the current interval) 
val uniquesThroughWindow = joined.transform{rdd => 
    rdd.collect{case (k,(current, prev)) if (prev == 1) => k}
}

Distinct Element across dstream

Answers (1)

Related Questions