Reputation: 2520
im new in Spark, and i trying to calculate a cov and cor. But when i try to made a sum, apache shows me a and error
My code:
from pyspark.mllib.stat import Statistics
rddX = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
rddY = sc.parallelize([7,6,5,4,5,6,7,8,9,10])
XY= rddX.zip(rddY)
Statistics.corr(XY)
Meanx = rddX.sum()/rddX.count()
Meany = rddY.sum()/rddY.count()
print(Meanx,Meany)
cov = XY.map(lambda x,y: (x-Meanx)*(y-Meany)).sum()
My error:
opt/ibm/spark/python/pyspark/rdd.py in fold(self, zeroValue, op)
913 # zeroValue provided to each partition is unique from the one provided
914 # to the final reduce call
--> 915 vals = self.mapPartitions(func).collect()
916 return reduce(op, vals, zeroValue)
Why show this error?, if sum returns an RRDD array?
Upvotes: 0
Views: 450
Reputation: 1940
Try to separate the RDD's operation:
sumX = rddX.sum()
countX = rddX.count()
meanX = sumX / countX
Upvotes: 1