Merve Bozo
Merve Bozo

Reputation: 439

Apache Spark MLlib LabeledPoint null label issue

I'm trying to run one of MLlib algorithms, namely LogisticRegressionWithLBFGS on my database.

This algorithm takes the training set as LabeledPoint. Since LabeledPoint requires a double label ( LabeledPoint( double label, Vector features) ) and my database contains some null values, how can I solve this problem?

Here you can see the piece of code related to this issue :

val labeled = table.map{ row => 
    var s = row.toSeq.toArray           
    s = s.map(el => if (el != null) el.toString.toDouble)
    LabeledPoint(row(0), Vectors.dense((s.take(0) ++ s.drop(1))))
    }

And the error that I get:

error   : type mismatch;
found   : Any
required: Double

Without using LabeledPoint can I run this algorithm or how can I overcome this "null value" issue?

Upvotes: 1

Views: 1026

Answers (1)

zero323
zero323

Reputation: 330173

Some reasons why this code cannot work:

  • Row.toSeq is of type () => Seq[Any] and so is s
  • since you cover only not null case el => if (el != null) el.toString.toDouble is of type T => AnyVal (where T is any). If el is null it returns Unit
  • even if it wasn't you assign it to var of type Seq[Any] this is exactly what you get. One way or another it is not a valid input for Vectors.dense
  • Row.apply is of type Int => Any so the output cannot be used as a label

Should work but have no effect:

  • s.take(0)

May stop working in Spark 2.0

  • map over DataFrame - not much we can do about it now since Vector class has no encoder available.

How you can approach this:

  • either filter complete rows or fill missing values for example using DataFrameNaFunctions:

      // You definitely want something smarter than that
      val fixed = df.na.fill(0.0)
      // or
      val filtered = df.na.drop
    
  • use VectorAssembler to build vectors:

    import org.apache.spark.ml.feature.VectorAssembler
    
    val assembler = new VectorAssembler()
      .setInputCols(df.columns.tail)
      .setOutputCol("features")
    
    val assembled = assembler.transform(fixed)
    
  • convert to LabledPoint

    import org.apache.spark.mllib.regression.LabeledPoint  
    
    
    // Assuming lable column is called label
    
    assembled.select($"label", $"features").rdd.map {
      case Row(label: Double, features: Vector) => 
        LabeledPoint(label, features)
    }
    

Upvotes: 2

Related Questions