Reputation: 1944
I want to create a data frame with three columns:"variable 1","variable 2", "correlations" from a given dataframe of 200 variables
Now for any two columns in a dataframe I am using the following code to calculate correlations
import sqlContext.implicits._
import org.apache.spark.mllib.stat.Statistics
// Generate some random data
scala.util.Random.setSeed(1)
val df = sc.parallelize(g.sample(1000).zip(g.sample(1000))).toDF("x", "y")
// Select columns and extract values
val rddX = df.select($"x").rdd.map(_.getDouble(0))
val rddY = df.select($"y").rdd.map(_.getDouble(0))
val correlation: Double = Statistics.corr(rddX, rddY, "spearman")
How I can I do the same for a set of x variables in the dataframe so as to find out the variables with highest correlations from the resulting dataframe
Upvotes: 0
Views: 1728
Reputation: 24198
You should first convert your RDD[Row]
to an RDD[Vector]
, and then you can simply use Statistics.corr()
with your rdd
as the input argument to generate a correlation matrix:
import org.apache.spark.mllib.linalg.Vectors
val rdd_vec = df.rdd.map(row => {
Vectors.dense(row.toSeq.toArray.map({
case d: Double => d
}))
})
val correlMatrix = Statistics.corr(rdd_vec, "spearman")
Upvotes: 2