trying to understand the reduceByKey() action's behaviour

Question

I just want to find the averages of all the values associated to a particular key and below is my program:

from pyspark import SparkContext,SparkConf

conf = SparkConf().setAppName("averages").setMaster("local")
sc = SparkContext(conf=conf)

file_rdd = sc.textFile("C:\spark_programs\python programs\input")

vals_rdd = file_rdd.map(lambda x:(x.split(" ")[0],int(x.split(" ")[2])))

print type(vals_rdd)

pairs_rdd = vals_rdd.reduceByKey(lambda x,y:(x+y)/2)

for line in pairs_rdd.collect():
    print line

following is the input data:

a hyd 2
b hyd 2
c blr 3
d chn 4
b hyd 5

when I run the program the output which I get is below:

(u'a', 2)
(u'c', 3)
(u'b', 3) -- I could see only got b's value getting averaged.
(u'd', 4)

apart from b's value all the values aren't averaged. Why does it happen? Why aren't a,c,d values averaged??

trying to understand the reduceByKey() action's behaviour

Answers (1)

Related Questions

trying to understand the reduceByKey() action&#39;s behaviour

Answers (1)

Related Questions

trying to understand the reduceByKey() action's behaviour