Reputation: 21

How does result changes by using .distinct() in spark?

I was working with Apache Log File. And I created RDD with tuple (day,host) from each log line. Next step was to Group up host and then display the result.

I used distinct() with mapping of first RDD into (day,host) tuple. When I don't use distinct I get different result as when I do. So how does a result change when using distinct() in spark??

Upvotes: 0

Answers (2)

Manish

Reputation: 41

Distinct removes the duplicate entries for a particular key. Your count should reduce or remain same after applying distinct.

http://spark.apache.org/docs/0.7.3/api/pyspark/pyspark.rdd.RDD-class.html#distinct

Upvotes: 1

KlwntSingh

Reputation: 1092

I think when you only use map action on FIRST_RDD(logs) you will get SECOND_RDD count of new this SECOND_RDD will be equal to count of FIRST_RDD.
But if you use distinct on SECOND_RDD, count will decrease to number of distinct tuples present in SECOND_RDD.

Upvotes: 0

How does result changes by using .distinct() in spark?

Answers (2)

Related Questions