Dale Angus
Dale Angus

Reputation: 43

Spark ML Library

I am testing this Scala code that I found in the MLlib: Main Guide Machine Learning Library (MLlib) Guide

import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
import scala.collection.Seq

object BasicStatistics {
  def main(args: Array[String]): Unit = {

    val data: Seq[Vector] = Seq(
      Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
      Vectors.dense(4.0, 5.0, 0.0, 3.0),
      Vectors.dense(6.0, 7.0, 0.0, 8.0),
      Vectors.sparse(4, Seq((0, 9.0), (3, 1.0))))

    val df = data.map(Tuple1.apply).toDF("features")
    val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
    println(s"Pearson correlation matrix:\n $coeff1")

    val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
    println(s"Spearman correlation matrix:\n $coeff2")

  }
}

But this line is reporting an error.

val df = data.map(Tuple1.apply).toDF("features")

It says, "value toDF is not a member of Seq[(org.apache.spark.ml.linalg.Vector,)]"

Seems like the value data (Seq[Vector]) does not have a map method?

Any ideas on how to proceed?

Below is from my pom.xml

<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.11</artifactId>
        <version>2.3.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.3.0</version>
    </dependency>

</dependencies>

Upvotes: 0

Views: 220

Answers (2)

Haroun Mohammedi
Haroun Mohammedi

Reputation: 2424

this is because of missing implicit conversion for scala.Seq.

To fix your problem add theses line

val name = "application name"
val spark = SparkSession
  .builder
  .appName(name)
  .master("local")
  .getOrCreate()

import spark.implicits._

Hope it helps !

Upvotes: 1

hoyland
hoyland

Reputation: 1824

At this point, you don't have a SparkSession or anything started. I believe toDF comes from importing spark.implicits._ where spark is a SparkSession. The documentation sometimes does not make this clear and/or assumes you're working in the Spark shell, which creates the session automatically.

Your code does run in the spark shell.

Upvotes: 0

Related Questions