Reputation: 139
I have a dataframe with the following schema using pyspark:
|-- suborders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- trackingStatusHistory: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- trackingStatusUpdatedAt: string (nullable = true)
| | | | |-- trackingStatus: string (nullable = true)
What I want to do is create a new deliveredat
element for each suborders
array using conditions.
I need to find the date within the trackingStatusHistory
array where trackingStatusHistory.trackingStatus = 'delivered'
. If this trackingStatus
exists, the new deliveredat
element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt
. If doesn't exist, receive null
.
How can I do this using pyspark?
Upvotes: 2
Views: 1409
Reputation: 32710
You can do that using higher-order functions transform
+ filter
on arrays. For each struct element of suborders
array you add a new field by filtering the sub-array trackingStatusHistory
and getting the delivery date, like this:
import pyspark.sql.functions as F
df = df.withColumn(
"suborders",
F.expr("""transform(
suborders,
x -> struct(
filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
x.trackingStatusHistory as trackingStatusHistory
)
)
""")
)
Upvotes: 2