Reputation: 21
I was working with Apache Log File. And I created RDD with tuple (day,host) from each log line. Next step was to Group up host and then display the result.
I used distinct() with mapping of first RDD into (day,host) tuple. When I don't use distinct I get different result as when I do. So how does a result change when using distinct() in spark??
Upvotes: 0
Views: 184
Reputation: 41
Distinct removes the duplicate entries for a particular key. Your count should reduce or remain same after applying distinct.
http://spark.apache.org/docs/0.7.3/api/pyspark/pyspark.rdd.RDD-class.html#distinct
Upvotes: 1
Reputation: 1092
I think when you only use map action on FIRST_RDD(logs) you will get SECOND_RDD count of new this SECOND_RDD will be equal to count of FIRST_RDD.
But if you use distinct on SECOND_RDD, count will decrease to number of distinct tuples present in SECOND_RDD.
Upvotes: 0