How to update a value in an array of structs in a dataframe in pyspark?

Question

I have the following schema:

>>> df.printSchema()
root
... SNIP ...
 |-- foo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
... SNIP ...
 |    |    |-- value: double (nullable = true)
 |    |    |-- value2: double (nullable = true)

In this case, I only have one row in the dataframe and in the foo array:

>>> df.count()
1
>>> df.select(explode('foo').alias("fooColumn")).count()
1

value is null:

>>> df.select(explode('foo').alias("fooColumn")).select('fooColumn.value','fooColumn.value2').show()
+-----+------+
|value|value2|
+-----+------+
| null|  null|
+-----+------+

I want to edit value and make a new dataframe. I can explode foo and set value:

>>> fooUpdated = df.select(explode("foo").alias("fooColumn")).select("fooColumn.*").withColumn('value', lit(10)).select('value').show()
+-----+
|value|
+-----+
|   10|
+-----+

How do I collapse this dataframe to put fooUpdated back in as an array with a struct element or is there a way to do this without exploding foo?

In the end, I want to have the following:

>>> dfUpdated.select(explode('foo').alias("fooColumn")).select('fooColumn.value', 'fooColumn.value2').show()
+-----+------+
|value|value2|
+-----+------+
|   10|  null|
+-----+------+

blackbishop · Accepted Answer

You can use transform function to update each struct in the foo array.

Here's an example:

import pyspark.sql.functions as F

df.printSchema()

#root
# |-- foo: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- value: string (nullable = true)
# |    |    |-- value2: long (nullable = true)

df1 = df.withColumn(
    "foo",
    F.expr("transform(foo, x -> struct(coalesce(x.value, 10) as value, x.value2 as value2))")
)

Now, you can show the value in df1 to verify it was updated:

df1.select(F.expr("inline(foo)")).show()
#+-----+------+
#|value|value2|
#+-----+------+
#|   10|    30|
#+-----+------+

How to update a value in an array of structs in a dataframe in pyspark?

Answers (1)

Related Questions