Reputation: 439
I'm trying to run one of MLlib algorithms, namely LogisticRegressionWithLBFGS on my database.
This algorithm takes the training set as LabeledPoint. Since LabeledPoint requires a double label ( LabeledPoint( double label, Vector features) ) and my database contains some null values, how can I solve this problem?
Here you can see the piece of code related to this issue :
val labeled = table.map{ row =>
var s = row.toSeq.toArray
s = s.map(el => if (el != null) el.toString.toDouble)
LabeledPoint(row(0), Vectors.dense((s.take(0) ++ s.drop(1))))
}
And the error that I get:
error : type mismatch;
found : Any
required: Double
Without using LabeledPoint can I run this algorithm or how can I overcome this "null value" issue?
Upvotes: 1
Views: 1026
Reputation: 330173
Some reasons why this code cannot work:
Row.toSeq
is of type () => Seq[Any]
and so is s
el => if (el != null) el.toString.toDouble
is of type T => AnyVal
(where T
is any). If el
is null
it returns Unit
var
of type Seq[Any]
this is exactly what you get. One way or another it is not a valid input for Vectors.dense
Row.apply
is of type Int => Any
so the output cannot be used as a labelShould work but have no effect:
s.take(0)
May stop working in Spark 2.0
map
over DataFrame
- not much we can do about it now since Vector
class has no encoder available.How you can approach this:
either filter complete rows or fill missing values for example using DataFrameNaFunctions
:
// You definitely want something smarter than that
val fixed = df.na.fill(0.0)
// or
val filtered = df.na.drop
use VectorAssembler
to build vectors:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(df.columns.tail)
.setOutputCol("features")
val assembled = assembler.transform(fixed)
convert to LabledPoint
import org.apache.spark.mllib.regression.LabeledPoint
// Assuming lable column is called label
assembled.select($"label", $"features").rdd.map {
case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)
}
Upvotes: 2