Spark dataframe to sparse vector with null

Question

I am facing a problem when I try to assemble a vector form a dataframe (Some columns contain null values) in scala. Unfortunately vectorAssembler cannot handle null values.

What I can do is to replace or fill dataframe's null values and then create a dense vector but that is not what I want.

So I thought about converting my dataframe rows to a sparse vector. But how can I achive this? I have not found an option for the vectorAssembler to make a sparse vector.

EDIT: Actually I do not need null in the sparse vector but it shouldn't be a value like 0 or any other as it would be the case for a dense vector.

Do you have any suggestions?

jamborta · Accepted Answer

You could do it manually like this:

import org.apache.spark.SparkException
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ArrayBuilder

case class Row(a: Double, b: Option[Double], c: Double, d: Vector, e: Double)

val dataset = spark.createDataFrame(
  Seq(new Row(0, None, 3.0, Vectors.dense(4.0, 5.0, 0.5), 7.0),
    new Row(1, Some(2.0), 3.0, Vectors.dense(4.0, 5.0, 0.5), 7.0))
).toDF("id", "hour", "mobile", "userFeatures", "clicked")

val sparseVectorRDD = dataset.rdd.map { row =>
  val indices = ArrayBuilder.make[Int]
  val values = ArrayBuilder.make[Double]
  var cur = 0
  row.toSeq.foreach {
    case v: Double =>
      indices += cur
      values += v
      cur += 1
    case vec: Vector =>
      vec.foreachActive { case (i, v) =>
        indices += cur + i
        values += v
      }
      cur += vec.size
    case null =>
      cur += 1
    case o =>
      throw new SparkException(s"$o of type ${o.getClass.getName} is not supported.")
  }
  Vectors.sparse(cur, indices.result(), values.result())
}

And then convert it back to a dataframe if needed. Since Row objects are not type checked, you have to handle it manually and cast to the appropriate type if needed.

Spark dataframe to sparse vector with null

Answers (1)

Related Questions