Reputation: 87
I have this dataframe:
+----+--------------------------------+
|name|dates |
+----+--------------------------------+
|A |[[1994, 12, 11], [,,]] |
|B |[[1994, 12, 11], [1994, 12, 15]]|
+----+--------------------------------+
with this schema:
root
|-- name: string (nullable = true)
|-- dates: struct (nullable = true)
| |-- start_date: struct (nullable = true)
| | |-- year: integer (nullable = true)
| | |-- month: integer (nullable = true)
| | |-- day: integer (nullable = true)
| |-- end_date: struct (nullable = true)
| | |-- year: integer (nullable = true)
| | |-- month: integer (nullable = true)
| | |-- day: integer (nullable = true)
I want to have this as output
when all fields inside end_date
are null, set end date as null
+----+--------------------------------+
|name|dates |
+----+--------------------------------+
|A |[[1994, 12, 11],] |
|B |[[1994, 12, 11], [1994, 12, 15]]|
+----+--------------------------------+
Upvotes: 2
Views: 1813
Reputation: 32660
You can update the struct column dates
by recreating a new struct from the existing attributes and use when
expression to check if all end_dates
attributes are null:
val df2 = df.withColumn(
"dates",
struct(
col("dates.start_date"), // keep start_date
when(
Seq("year", "month", "day")
.map(x => col(s"dates.end_date.$x").isNull)
.reduce(_ and _),
lit(null).cast("struct<year:int,month:int,day:int>")
).otherwise(col("dates.end_date")).alias("end_date") // set end_date to null if all attr are null
)
)
df2.show(false)
//+----+--------------------------------+
//|name|dates |
//+----+--------------------------------+
//|A |[[1994, 12, 11],] |
//|B |[[1994, 12, 11], [1994, 12, 25]]|
//+----+--------------------------------+
Upvotes: 1