Reputation: 33
I have a dataframe with "id" column and a column which has an array of struct. The schema:
root
|-- id: string (nullable = true)
|-- desc: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: long (nullable = false)
The array "desc" can have any number of null values. I would like to create the final dataframe with the array having none of the null values using Spark 1.6:
An example would be:
Key . Value
1010 . [[George,21],null,[MARIE,13],null]
1023 . [null,[Watson,11],[John,35],null,[Kyle,33]]
I want the final dataframe as:
id . desc
1010 . [[George,21],[MARIE,13]]
1023 . [[Watson,11],[John,35],[Kyle,33]]
I tried doing this with UDF and case class
but got
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to....
Any help is greatly appreciated and I would prefer doing it without converting to RDDs if needed.
Upvotes: 3
Views: 7550
Reputation: 24498
Spark 3.4+
array_compact($"desc")
Example input:
case class Person(name: String, age: Long)
val df1 = Seq(
("1010", Seq(Person("George", 21), null, Person("MARIE", 13), null)),
("1023", Seq(null, Person("Watson", 11), Person("John", 35), null, Person("Kyle", 33)))
).toDF("id", "desc")
df1.show(truncate=false)
// +----+--------------------------------------------------+
// |id |desc |
// +----+--------------------------------------------------+
// |1010|[{George, 21}, null, {MARIE, 13}, null] |
// |1023|[null, {Watson, 11}, {John, 35}, null, {Kyle, 33}]|
// +----+--------------------------------------------------+
df1.printSchema()
// root
// |-- id: string (nullable = true)
// |-- desc: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- name: string (nullable = true)
// | | |-- age: long (nullable = false)
Using array_compact
:
val df2 = df1.withColumn("desc", array_compact($"desc"))
df2.show(truncate=false)
// +----+--------------------------------------+
// |id |desc |
// +----+--------------------------------------+
// |1010|[{George, 21}, {MARIE, 13}] |
// |1023|[{Watson, 11}, {John, 35}, {Kyle, 33}]|
// +----+--------------------------------------+
Upvotes: 1
Reputation: 41987
Given that the original dataframe has following schema
root
|-- id: string (nullable = true)
|-- desc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: long (nullable = false)
Defining a udf
function to remove the null values from the array should work for you
import org.apache.spark.sql.functions._
def removeNull = udf((array: Seq[Row])=> array.filterNot(_ == null).map(x => element(x.getAs[String]("name"), x.getAs[Long]("age"))))
df.withColumn("desc", removeNull(col("desc")))
where element
is a case class
case class element(name: String, age: Long)
and you should get
+----+-----------------------------------+
|id |desc |
+----+-----------------------------------+
|1010|[[George,21], [MARIE,13]] |
|1010|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------+
Upvotes: 1
Reputation: 1528
Here is another version:
case class Person(name: String, age: Int)
root
|-- id: string (nullable = true)
|-- desc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = false)
+----+-----------------------------------------------+
|id |desc |
+----+-----------------------------------------------+
|1010|[[George,21], null, [MARIE,13], null] |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|
+----+-----------------------------------------------+
val filterOutNull = udf((xs: Seq[Row]) => {
xs.flatMap {
case null => Nil
// convert the Row back to your specific struct:
case Row(s: String,i: Int) => List(Person(s, i))
}
})
val result = df.withColumn("filteredListDesc", filterOutNull($"desc"))
+----+-----------------------------------------------+-----------------------------------+
|id |desc |filteredListDesc |
+----+-----------------------------------------------+-----------------------------------+
|1010|[[George,21], null, [MARIE,13], null] |[[George,21], [MARIE,13]] |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------------------+-----------------------------------+
Upvotes: 2