user3240688
user3240688

Reputation: 1327

Spark Streaming - DStream does not have distinct()

I want to count the distinct value of some type of IDs represented as an RDD.

In the non-streaming case, it's fairly straightforward. Say IDs is an RDD of IDs read from a flat file.

    print ("number of unique IDs %d" %  (IDs.distinct().count()))

But I can't seem to do the same thing in the streaming case. Say we have streamIDs be a DStream of IDs read from the network.

    print ("number of unique IDs from stream %d" %  (streamIDs.distinct().count()))

Gives me this error

AttributeError: 'TransformedDStream' object has no attribute 'distinct'

What am I doing wrong? How do I printout the number of distinct IDs that showed up during this batch?

Upvotes: 3

Views: 3752

Answers (2)

user1384205
user1384205

Reputation: 1291

Have you tried using:

yourDStream.transform(r => r.distinct())

Upvotes: -1

juanrh0011
juanrh0011

Reputation: 323

With RDDs you have a single result, but with DStreams you have a series of results with a result per micro batch. So you cannot print the number of unique ids once, but instead you have to register an action to print the unique ids for each micro batch, which is a RDD on which you can use distinct:

streamIDs.foreachRDD(rdd => println(rdd.distinct().count()))

Remember you can use window to create a transformed dstream with bigger batches:

streamIDs.window(Duration(1000)).foreachRDD(rdd => println(rdd.distinct().count()))

Upvotes: 9

Related Questions