Reputation: 35
I just want to find the averages of all the values associated to a particular key and below is my program:
from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName("averages").setMaster("local")
sc = SparkContext(conf=conf)
file_rdd = sc.textFile("C:\spark_programs\python programs\input")
vals_rdd = file_rdd.map(lambda x:(x.split(" ")[0],int(x.split(" ")[2])))
print type(vals_rdd)
pairs_rdd = vals_rdd.reduceByKey(lambda x,y:(x+y)/2)
for line in pairs_rdd.collect():
print line
following is the input data:
a hyd 2
b hyd 2
c blr 3
d chn 4
b hyd 5
when I run the program the output which I get is below:
(u'a', 2)
(u'c', 3)
(u'b', 3) -- I could see only got b's value getting averaged.
(u'd', 4)
apart from b's value all the values aren't averaged. Why does it happen? Why aren't a,c,d values averaged??
Upvotes: 0
Views: 150
Reputation: 330063
Merge the values for each key using an associative and commutative reduce function.
Function you pass doesn't satisfy these requirements. In particular it is not associative:
f = lambda x,y:(x + y) / 2
f(1, f(2, 3))
## 1.75
f(f(1, 2), 3)
## 2.25
So it is not applicable in your case and it wouldn't average the values.
values aren't averaged. Why does it happen?
Apart from the fundamental flaw explained above, there is only one value for each of the remaining keys, so there is no reason to call merging function at all.
I just want to find the averages values associated to a particular key
Just use DataFrames
:
vals_rdd.toDF().groupBy("_1").avg()
although you can use aggregateByKey
with StatCounter
(numerically stable) or map
-> reduceByKey
-> map
(numerically unstable).
Additionally, I strongly recommend reading great answers to reduceByKey: How does it work internally?.
Upvotes: 1