Reputation: 458
I am working on window dstreams wherein each dstream contains 3 rdd with following keys:
a,b,c
b,c,d
c,d,e
d,e,f
I want to get only unique keys across all dstream
a,b,c,d,e,f
How to do it in spark streaming?
Upvotes: 1
Views: 1559
Reputation: 37435
We could use a window of t+4 intervals to keep a count of "last recently seen keys" and use that to remove duplicates on the current interval.
Something in the lines of this:
// original dstream
val dstream = ???
// make distinct (for a single interval) and pair with 1's for counting
val keyedDstream = dstream.transform(rdd=> rdd.distinct).map(e => (e,1))
// keep a window of t*4 with the count of distinct keys we have seen
val windowed = keyedDstream.reduceByKeyAndWindow((x:Int,y:Int) => x+y, Seconds(4),Seconds(1))
// join the windowed count with the initially keyed dstream
val joined = keyedDstream.join(windowed)
// the unique keys though the window are those with a running count of 1 (only seen in the current interval)
val uniquesThroughWindow = joined.transform{rdd =>
rdd.collect{case (k,(current, prev)) if (prev == 1) => k}
}
Upvotes: 2