Reputation: 720
I am facing a problem when I try to assemble a vector form a dataframe (Some columns contain null
values) in scala. Unfortunately vectorAssembler
cannot handle null
values.
What I can do is to replace or fill dataframe's null
values and then create a dense
vector but that is not what I want.
So I thought about converting my dataframe rows to a sparse
vector. But how can I achive this? I have not found an option for the vectorAssembler
to make a sparse vector.
EDIT: Actually I do not need null
in the sparse vector but it shouldn't be a value like 0
or any other as it would be the case for a dense vector.
Do you have any suggestions?
Upvotes: 1
Views: 1604
Reputation: 5210
You could do it manually like this:
import org.apache.spark.SparkException
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ArrayBuilder
case class Row(a: Double, b: Option[Double], c: Double, d: Vector, e: Double)
val dataset = spark.createDataFrame(
Seq(new Row(0, None, 3.0, Vectors.dense(4.0, 5.0, 0.5), 7.0),
new Row(1, Some(2.0), 3.0, Vectors.dense(4.0, 5.0, 0.5), 7.0))
).toDF("id", "hour", "mobile", "userFeatures", "clicked")
val sparseVectorRDD = dataset.rdd.map { row =>
val indices = ArrayBuilder.make[Int]
val values = ArrayBuilder.make[Double]
var cur = 0
row.toSeq.foreach {
case v: Double =>
indices += cur
values += v
cur += 1
case vec: Vector =>
vec.foreachActive { case (i, v) =>
indices += cur + i
values += v
}
cur += vec.size
case null =>
cur += 1
case o =>
throw new SparkException(s"$o of type ${o.getClass.getName} is not supported.")
}
Vectors.sparse(cur, indices.result(), values.result())
}
And then convert it back to a dataframe if needed. Since Row objects are not type checked, you have to handle it manually and cast to the appropriate type if needed.
Upvotes: 1