Reputation: 817
I have the following schema:
>>> df.printSchema()
root
... SNIP ...
|-- foo: array (nullable = true)
| |-- element: struct (containsNull = true)
... SNIP ...
| | |-- value: double (nullable = true)
| | |-- value2: double (nullable = true)
In this case, I only have one row in the dataframe and in the foo
array:
>>> df.count()
1
>>> df.select(explode('foo').alias("fooColumn")).count()
1
value
is null:
>>> df.select(explode('foo').alias("fooColumn")).select('fooColumn.value','fooColumn.value2').show()
+-----+------+
|value|value2|
+-----+------+
| null| null|
+-----+------+
I want to edit value
and make a new dataframe. I can explode foo
and set value
:
>>> fooUpdated = df.select(explode("foo").alias("fooColumn")).select("fooColumn.*").withColumn('value', lit(10)).select('value').show()
+-----+
|value|
+-----+
| 10|
+-----+
How do I collapse this dataframe to put fooUpdated
back in as an array with a struct element or is there a way to do this without exploding foo
?
In the end, I want to have the following:
>>> dfUpdated.select(explode('foo').alias("fooColumn")).select('fooColumn.value', 'fooColumn.value2').show()
+-----+------+
|value|value2|
+-----+------+
| 10| null|
+-----+------+
Upvotes: 1
Views: 2279
Reputation: 32710
You can use transform
function to update each struct in the foo
array.
Here's an example:
import pyspark.sql.functions as F
df.printSchema()
#root
# |-- foo: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- value: string (nullable = true)
# | | |-- value2: long (nullable = true)
df1 = df.withColumn(
"foo",
F.expr("transform(foo, x -> struct(coalesce(x.value, 10) as value, x.value2 as value2))")
)
Now, you can show the value in df1
to verify it was updated:
df1.select(F.expr("inline(foo)")).show()
#+-----+------+
#|value|value2|
#+-----+------+
#| 10| 30|
#+-----+------+
Upvotes: 2