Extracting single element from array in Spark Dataframe

Question

I've got a Dataset with the following structure:

{"name": "Ben",
"lastHolidayDestination": "Florida",
"holidays": [
    {"destination": "Florida",
     "year": 2020},
    {"destination": "Lille",
     "year": 2019}
]}

I want to add a new column lastHolidayYear to the root of the Dataset using Spark SQL, populated by finding the holidays element that joins onto lastHolidayDestination (assume there will only ever be one). So the output Dataset would be:

{"name": "Ben",
"lastHolidayDestination": "Florida",
"lastHolidayYear": 2020,
"holidays": [
    {"destination": "Florida",
     "year": 2020},
    {"destination": "Lille",
     "year": 2019}
]}

I've been playing around with dataset.withColumn() and when() (using Java, but Scala/Python answers are fine) but I've got nowhere so far. I really don't want to use a UDF unless I have to. Any suggestions?

alexeipab · Accepted Answer

To simulate the join with array you can use flatten and filter combo:

val result = ds.withColumn("expl", explode(col("holidays")))
               .filter("lastHolidayDestination = expl.destination")
               .withColumn("lastHolidayYear", col("expl.year"))
               .drop("expl")

Extracting single element from array in Spark Dataframe

Answers (2)

Related Questions