Yadav
Yadav

Reputation: 189

How to fetch value from an array of struct dataframe after comparing one of its attribute containing date

Schema of dataframe

root
    |-- parentColumn: array
    |    |-- element: struct
    |    |    |-- colA: string
    |    |    |-- colB: string
    |    |    |-- colTimestamp: string

value inside dataframe look like this

"parentColumn": [
        {
            "colA": "LatestValueA",
            "colB": "LatestValueB",
            "colTimestamp": "2020-08-18T04:00:44.986000"
        },
        {
            "colA": "OldValueA",
            "colB": "OldValueB",
            "colTimestamp": "2020-08-17T03:28:44.986000"
        }
    ]

I want to fetch the value of col A based on latest coltimestamp. In given scenario after comparison LatestValueA should be returned as its colTimeStamp is latest.

I want this value to add it as a value of new dataframe column

df.withColumn("newColumn", ?)

Upvotes: 1

Views: 45

Answers (1)

werner
werner

Reputation: 14845

You can sort the array descending based on colTimestamp and then take the colA of the first element:

df.withColumn('sorted', F.expr("""array_sort(parentColumn, (l,r) -> case 
          when l.colTimestamp < r.colTimestamp then 1 
          when l.colTimestamp > r.colTimestamp then -1 
          else 0 end)""")) \
  .withColumn('newColumn', F.col('sorted')[0].colA) \
  .show()

Upvotes: 1

Related Questions