Honey
Honey

Reputation: 11

Spark streaming - sliding window and use of distinct

We can use window to create a transformed DStream with bigger batches:

streamIDs.window(Duration(1000)).foreachRDD(rdd => println(rdd.distinct().count())) 

Is there any way to make the same in a moving window with slide duration also as a parameter ?

Upvotes: 1

Views: 899

Answers (1)

Daniel
Daniel

Reputation: 1676

Are you trying to filter out duplicates in the window? In this case you can emulate a distinct over a window by mapping your RDD to a key-value pair where the key is a copy of the original element and the value is not important, say null:

streamIDs
    .mapToPair(lambda s : (s, null))
    .reduceByKeyAndWindow(lambda t1, t2 : t1, Duration(1000))
    .map(lambda (x, y) : x)

This will create a DStream of distinct values from each sliding window.

In case you also want the count of each value in each window, do the following:

streamIDs
    .mapToPair(lambda s : (s, 1))
    .reduceByKeyAndWindow(lambda t1, t2 : t1 + t2, Duration(1000))

This will create a DStream of value count pairs, e.g. ('A', 3), ('B', 5)...

Upvotes: 1

Related Questions