Iterate across columns spark Scala

Question

I want to create a data frame with three columns:"variable 1","variable 2", "correlations" from a given dataframe of 200 variables

Now for any two columns in a dataframe I am using the following code to calculate correlations

import sqlContext.implicits._
import org.apache.spark.mllib.stat.Statistics

// Generate some random data
scala.util.Random.setSeed(1)
val df = sc.parallelize(g.sample(1000).zip(g.sample(1000))).toDF("x", "y")


// Select columns and extract values
val rddX = df.select($"x").rdd.map(_.getDouble(0))
val rddY = df.select($"y").rdd.map(_.getDouble(0))

val correlation: Double = Statistics.corr(rddX, rddY, "spearman")

How I can I do the same for a set of x variables in the dataframe so as to find out the variables with highest correlations from the resulting dataframe

Iterate across columns spark Scala

Answers (1)

Related Questions