user3763116
user3763116

Reputation: 235

Apache Spark - calculate correlation

I am trying to calculate correlation between user ratings. I came up with a simple program and now trying to understand the result of pearson correlation.

val user1 = Vectors.dense(10, 2, 3, 3)
val user2 = Vectors.dense(10, 3, 2, 2)
val user3 = Vectors.dense(1, 8, 9, 1)
val user4 = Vectors.dense(3, 9, 8, 2)
val user5 = Vectors.dense(1, 1, 1, 1)
val user6 = Vectors.dense(2, 2, 2, 2)


val users = spark.sparkContext.parallelize(Array(user1, user2, user3, user4, user5, user6))

val corr = Statistics.corr(users)

And this is the matrix result for reference:

1.0                   -0.30336465877348895  -0.33033040622002124  0.7679896586280794    
-0.30336465877348895  1.0                   0.9660056657223798    -0.21945076948288175  
-0.33033040622002124  0.9660056657223798    1.0                   -0.21945076948288175  
0.7679896586280794    -0.21945076948288175  -0.21945076948288175  1.0     

Could someone help me interpret this matrix? I was surprised that it contains 4 columns and 4 rows (I have six users as the input)?

Upvotes: 0

Views: 954

Answers (1)

zero323
zero323

Reputation: 330303

There is not much to explain here. As you can read in the API docs corr(X: RDD[Vector]) returns:

Pearson correlation matrix comparing columns in X.

So four columns means 4*4 matrix.

Upvotes: 1

Related Questions