Daniel ORTIZ
Daniel ORTIZ

Reputation: 2520

Covariance and Correlation with Spark and Python fails

im new in Spark, and i trying to calculate a cov and cor. But when i try to made a sum, apache shows me a and error

My code:

from pyspark.mllib.stat import Statistics
rddX = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
rddY = sc.parallelize([7,6,5,4,5,6,7,8,9,10])
XY= rddX.zip(rddY)


Statistics.corr(XY)
Meanx = rddX.sum()/rddX.count()
Meany = rddY.sum()/rddY.count()
print(Meanx,Meany)
cov = XY.map(lambda x,y: (x-Meanx)*(y-Meany)).sum()

My error:

opt/ibm/spark/python/pyspark/rdd.py in fold(self, zeroValue, op)
    913         # zeroValue provided to each partition is unique from the one provided
    914         # to the final reduce call
--> 915         vals = self.mapPartitions(func).collect()
    916         return reduce(op, vals, zeroValue)

Why show this error?, if sum returns an RRDD array?

Upvotes: 0

Views: 450

Answers (1)

Henrique Branco
Henrique Branco

Reputation: 1940

Try to separate the RDD's operation:

sumX = rddX.sum()
countX = rddX.count()
meanX = sumX / countX

Upvotes: 1

Related Questions