Lucas Sousa
Lucas Sousa

Reputation: 411

Use high order function to add calculated field in array of structs in Spark query

I have a table with a column of array of structs with this schema:

root
 |-- id: long (nullable = true)
 |-- mainColumn: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- fieldA: string (nullable = true)
 |    |    |-- fieldB: string (nullable = true)
 |    |    |-- fieldC: string (nullable = true)
 |    |    |-- fieldD: string (nullable = true)
 |-- randomOtherColumn: short (nullable = true)

I need to create a query that returns the entire table but applies a function to the fieldA of every entry (an encryption function). I have tried using the transform() function but I also need to return the other fields (fieldB , fieldC, etc).

SELECT
  x.id,
  x.randomOtherColumn,
  transform(y -> ???)
FROM
  my_table x

Something like this would be simple in JavaScript with the spread operator:

df.map(x => (
  { 
    x.id, 
    x.randomOtherColumn, 
    x.mainColumn.map(y => ({ ...y, fieldA: encrypt(y.fieldA) })) 
  }
))

Upvotes: 2

Views: 411

Answers (1)

blackbishop
blackbishop

Reputation: 32660

You need to recreate the inner struct elements of the array like this:

SELECT
  id,
  randomOtherColumn,
  transform(mainColumn, x -> struct(encrypt(fieldA) as 'fieldA', fieldB as 'fieldB', fieldC as 'fieldC', fieldD as 'fieldD')) as mainColumn
FROM
  my_table

Or using DataFrame API with column method withField (since Spark 3.1+):

val df1 = df.withColumn(
  "mainColumn",
  transform(col("mainColumn"), x => x.withField("fieldA", encrypt(x("fieldA"))))
)

As you also tagged pyspark:

from pyspark.sql import functions as F

df1 = df.withColumn(
  "mainColumn",
  F.transform("mainColumn", lambda x: x.withField("fieldA", encrypt(x["fieldA"])))
)

Upvotes: 2

Related Questions