Kurt Maile
Kurt Maile

Reputation: 1267

Spark SQL - Nested array conditional select

I have a spark SQL question Id appreciate some guidance on the best way to do a conditional select from nested array of structs.

I have an example json document below

```

{
   "id":"p1",
   "externalIds":[
      {"system":"a","id":"1"},
      {"system":"b","id":"2"},
      {"system":"c","id":"3"}
    ]
}

```

In spark SQL I want to select the "id" of one of the array structs based on some conditional logic.

e.g for above, select the id field of array sub element that has "system" = "b", namely the id of "2".

How best to do this in SparkSQL?

Cheers and thanks!

Upvotes: 2

Views: 1484

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

Using an UDF, this could look like this, given a Dataframe (all attributes of type String):

+---+---------------------+
|id |externalIds          |
+---+---------------------+
|p1 |[[a,1], [b,2], [c,3]]|
+---+---------------------+

Define an UDF to traverse your array and find the desired element:

def getExternal(system: String) = {
  udf((row: Seq[Row]) =>
    row.map(r => (r.getString(0), r.getString(1)))
      .find { case (s, _) => s == system}
      .map(_._2)
      .orElse(None)
  )
}

and use it like this:

df
  .withColumn("external",getExternal("b")($"externalIds"))
  .show(false)

+---+---------------------+--------+
|id |externalIds          |external|
+---+---------------------+--------+
|p1 |[[a,1], [b,2], [c,3]]|2       |
+---+---------------------+--------+

Upvotes: 2

Related Questions