user4054919
user4054919

Reputation: 129

Spark fill DataFrame with Vector for null

I have a DataFrame that contains feature vectors created by the VectorAssembler, it also contains null values. I now want to replace the null values with a vector:

 val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)

df.na.fill(nil) // does not work.

What is the right way to do this?

EDIT: I found a way thanks to the answer:

val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)

import sc.implicits._
var fill = Seq(Tuple1(nil)).toDF("replacement")

val dates = data.schema.fieldNames.filter(e => e.contains("1"))

data = data.crossJoin(broadcast(fill))
for(e <- dates){
  data = data.withColumn(e, coalesce(data.col(e), $"replacement"))
}
data = data.drop("replacement")

Upvotes: 1

Views: 1146

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

If the problem is created by adding some additional rows you join with replacement:

import org.apache.spark.sql.functions._

val df = Seq((1, None), (2, Some(nil))).toDF("id", "vector")
val fill = Seq(Tuple1(nil)).toDF("replacement")

df.crossJoin(broadcast(fill)).withColumn("vector", coalesce($"vector", $"replacement")).drop("replacement")

Upvotes: 2

Related Questions