Reputation: 11917
I have a PySpark dataframe like this,
+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id |201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|
+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| 1 | 15| 15| 15| 15| 15| 15| 15| 15| 15| null| 15| 15| 15|
| 2 | 4| 4| 4| 4| 4| 4| 4| 4| 4| 4| 4| 4| 4|
| 3 | 7| 7| 7| 7| 7| 7| 7| 7| null| null| null| null| null|
-------------------------------------------------------------------------------------------------------
From this data I want to find the latest non-null value for each row.
I am expecting to get the following result.
+----------+------+
|id. |latest|
+----------+------+
| 1 | 15|
| 2 | 4|
| 3 | 7|
-------------------
I followed this answer but I am not able to do the operation per row.
I used,
df.select([last(x, ignorenulls=True).alias(x) for x in df.columns])
But this code is taking only column wise, I want the same operation to be done row wise.
Upvotes: 0
Views: 866
Reputation: 663
Assuming your columns are ordered from oldest to latest, you can use the below code which uses coalesce
to obtain the latest value.
from pyspark.sql.functions import coalesce
df.select('id', coalesce(*[i for i in df.columns[::-1] if i != 'id']).alias('latest')).show()
Output:
+---+------+
| id|latest|
+---+------+
| 1| 15|
| 2| 4|
| 3| 7|
+---+------+
Upvotes: 2