Apache Spark MLlib LabeledPoint null label issue

Question

I'm trying to run one of MLlib algorithms, namely LogisticRegressionWithLBFGS on my database.

This algorithm takes the training set as LabeledPoint. Since LabeledPoint requires a double label ( LabeledPoint( double label, Vector features) ) and my database contains some null values, how can I solve this problem?

Here you can see the piece of code related to this issue :

val labeled = table.map{ row => 
    var s = row.toSeq.toArray           
    s = s.map(el => if (el != null) el.toString.toDouble)
    LabeledPoint(row(0), Vectors.dense((s.take(0) ++ s.drop(1))))
    }

And the error that I get:

error   : type mismatch;
found   : Any
required: Double

Without using LabeledPoint can I run this algorithm or how can I overcome this "null value" issue?

zero323 · Accepted Answer

Some reasons why this code cannot work:

Row.toSeq is of type () => Seq[Any] and so is s
since you cover only not null case el => if (el != null) el.toString.toDouble is of type T => AnyVal (where T is any). If el is null it returns Unit
even if it wasn't you assign it to var of type Seq[Any] this is exactly what you get. One way or another it is not a valid input for Vectors.dense
Row.apply is of type Int => Any so the output cannot be used as a label

Should work but have no effect:

s.take(0)

May stop working in Spark 2.0

map over DataFrame - not much we can do about it now since Vector class has no encoder available.

How you can approach this:

either filter complete rows or fill missing values for example using DataFrameNaFunctions:

  // You definitely want something smarter than that
  val fixed = df.na.fill(0.0)
  // or
  val filtered = df.na.drop

use VectorAssembler to build vectors:

import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(df.columns.tail)
  .setOutputCol("features")

val assembled = assembler.transform(fixed)

convert to LabledPoint

import org.apache.spark.mllib.regression.LabeledPoint  


// Assuming lable column is called label

assembled.select($"label", $"features").rdd.map {
  case Row(label: Double, features: Vector) => 
    LabeledPoint(label, features)
}

Apache Spark MLlib LabeledPoint null label issue

Answers (1)

Related Questions