Rajarshi Bhadra
Rajarshi Bhadra

Reputation: 1944

Iterate across columns spark Scala

I want to create a data frame with three columns:"variable 1","variable 2", "correlations" from a given dataframe of 200 variables

Now for any two columns in a dataframe I am using the following code to calculate correlations

import sqlContext.implicits._
import org.apache.spark.mllib.stat.Statistics

// Generate some random data
scala.util.Random.setSeed(1)
val df = sc.parallelize(g.sample(1000).zip(g.sample(1000))).toDF("x", "y")


// Select columns and extract values
val rddX = df.select($"x").rdd.map(_.getDouble(0))
val rddY = df.select($"y").rdd.map(_.getDouble(0))

val correlation: Double = Statistics.corr(rddX, rddY, "spearman")

How I can I do the same for a set of x variables in the dataframe so as to find out the variables with highest correlations from the resulting dataframe

Upvotes: 0

Views: 1728

Answers (1)

mtoto
mtoto

Reputation: 24198

You should first convert your RDD[Row] to an RDD[Vector], and then you can simply use Statistics.corr() with your rdd as the input argument to generate a correlation matrix:

import org.apache.spark.mllib.linalg.Vectors

val rdd_vec = df.rdd.map(row => {
  Vectors.dense(row.toSeq.toArray.map({
    case d: Double => d
  }))
})

val correlMatrix = Statistics.corr(rdd_vec, "spearman")

Upvotes: 2

Related Questions