Find latest non null value for each row in PySpark

Question

I have a PySpark dataframe like this,

+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id        |201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|
+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|  1       |    15|    15|    15|    15|    15|    15|    15|    15|    15|  null|    15|    15|    15|
|  2       |     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|
|  3       |     7|     7|     7|     7|     7|     7|     7|     7|  null|  null|  null|  null|  null|
-------------------------------------------------------------------------------------------------------

From this data I want to find the latest non-null value for each row.

I am expecting to get the following result.

+----------+------+
|id.         |latest|
+----------+------+
|  1       |    15| 
|  2       |     4|  
|  3       |     7|  
-------------------

I followed this answer but I am not able to do the operation per row.

I used,

df.select([last(x, ignorenulls=True).alias(x) for x in df.columns])

But this code is taking only column wise, I want the same operation to be done row wise.

noufel13 · Accepted Answer

Assuming your columns are ordered from oldest to latest, you can use the below code which uses coalesce to obtain the latest value.

from pyspark.sql.functions import coalesce

df.select('id', coalesce(*[i for i in df.columns[::-1] if i != 'id']).alias('latest')).show()

Output:

+---+------+
| id|latest|
+---+------+
|  1|    15|
|  2|     4|
|  3|     7|
+---+------+

Find latest non null value for each row in PySpark

Answers (1)

Related Questions