Maher HTB
Maher HTB

Reputation: 737

Add header to correlation matrix in Spark

I am applying correlation on a csv file using apache spark, when loading data I am obliged to skip the first row as a header which are columns in the dataset otherwise I can't load the data.

I get the correlation computed but when I got the correlation matrix, I can't add the columns name as a header in the new matrix. How to get the matrix with its header? This what I have tried:

import org.apache.spark.mllib.linalg.{ Vector, Vectors }
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.rdd.RDD

val data = sc.textFile(strfilePath).mapPartitionsWithIndex {
  case (index, iterator) => if (index == 0) iterator.drop(1) else iterator
}

val inputMatrix = data.map { line =>
  val values = line.split(",").map(_.toDouble)
  Vectors.dense(values)
}

val correlationMatrix = Statistics.corr(inputMatrix, "pearson")

Upvotes: 1

Views: 734

Answers (1)

Shaido
Shaido

Reputation: 28322

In Spark 2.0+ you can load a csv file into a dataframe using the command:

val df = spark.read.option("header", "true").option("inferSchema", "true").csv("filePath")

The correlations between different columns can then be computed with

df.stat.corr("col1", "col2", "pearson")

Upvotes: 1

Related Questions