Reputation: 11
We can use window to create a transformed DStream
with bigger batches:
streamIDs.window(Duration(1000)).foreachRDD(rdd => println(rdd.distinct().count()))
Is there any way to make the same in a moving window with slide duration also as a parameter ?
Upvotes: 1
Views: 899
Reputation: 1676
Are you trying to filter out duplicates in the window? In this case you can emulate a distinct over a window by mapping your RDD to a key-value pair where the key is a copy of the original element and the value is not important, say null
:
streamIDs
.mapToPair(lambda s : (s, null))
.reduceByKeyAndWindow(lambda t1, t2 : t1, Duration(1000))
.map(lambda (x, y) : x)
This will create a DStream of distinct values from each sliding window.
In case you also want the count of each value in each window, do the following:
streamIDs
.mapToPair(lambda s : (s, 1))
.reduceByKeyAndWindow(lambda t1, t2 : t1 + t2, Duration(1000))
This will create a DStream of value count pairs, e.g. ('A', 3), ('B', 5)...
Upvotes: 1