Reputation: 33
I have the following dataframe schema:
root
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
|-- cities: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- postcode: string (nullable = true
And my dataframe looks like this:
+---------+--------+-----------------------------------+
|firstname|lastname|cities |
+---------+--------+-----------------------------------+
|John |Doe |[[New York,A000000], [Warsaw,null]]|
|John |Smith |[[Berlin,null]] |
|John |null |[[Paris,null]] |
+---------+--------+-----------------------------------+
I want to replace all of null values with string "unknown". When I use na.fill function I get the following dataframe:
df.na.fill("unknown").show()
+---------+--------+-----------------------------------+
|firstname|lastname|cities |
+---------+--------+-----------------------------------+
|John |Doe |[[New York,A000000], [Warsaw,null]]|
|John |Smith |[[Berlin,null]] |
|John |unknown |[[Paris,null]] |
+---------+--------+-----------------------------------+
How can I replace ALL of the null values in dataframe (Including nested arrays)?
Upvotes: 1
Views: 2500
Reputation: 22449
na.fill
doesn't fill null elements within a struct field in an array column. One approach would be to use a UDF as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
case class City(name: String, postcode: String)
val df = Seq(
("John", "Doe", Seq(City("New York", "A000000"), City("Warsaw", null))),
("John", "Smith", Seq(City("Berlin", null))),
("John", null, Seq(City("Paris", null)))
).toDF("firstname", "lastname", "cities")
val defaultStr = "unknown"
def patchNull(default: String) = udf( (s: Seq[Row]) =>
s.map( r => (r.getAs[String]("name"), r.getAs[String]("postcode")) match {
case (null, null) => (default, default)
case (c, null) => (c, default)
case (null, p) => (default, p)
case e => e
}
) )
df.
withColumn( "cities", patchNull(defaultStr)($"cities") ).
na.fill(defaultStr).
show(false)
// +---------+--------+--------------------------------------+
// |firstname|lastname|cities |
// +---------+--------+--------------------------------------+
// |John |Doe |[[New York,A000000], [Warsaw,unknown]]|
// |John |Smith |[[Berlin,unknown]] |
// |John |unknown |[[Paris,unknown]] |
// +---------+--------+--------------------------------------+
Upvotes: 3