Sreeram TP
Sreeram TP

Reputation: 11917

Find latest non null value for each row in PySpark

I have a PySpark dataframe like this,

+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id        |201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|
+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|  1       |    15|    15|    15|    15|    15|    15|    15|    15|    15|  null|    15|    15|    15|
|  2       |     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|
|  3       |     7|     7|     7|     7|     7|     7|     7|     7|  null|  null|  null|  null|  null|
-------------------------------------------------------------------------------------------------------

From this data I want to find the latest non-null value for each row.

I am expecting to get the following result.

+----------+------+
|id.         |latest|
+----------+------+
|  1       |    15| 
|  2       |     4|  
|  3       |     7|  
-------------------

I followed this answer but I am not able to do the operation per row.

I used,

df.select([last(x, ignorenulls=True).alias(x) for x in df.columns])

But this code is taking only column wise, I want the same operation to be done row wise.

Upvotes: 0

Views: 866

Answers (1)

noufel13
noufel13

Reputation: 663

Assuming your columns are ordered from oldest to latest, you can use the below code which uses coalesce to obtain the latest value.

from pyspark.sql.functions import coalesce

df.select('id', coalesce(*[i for i in df.columns[::-1] if i != 'id']).alias('latest')).show()

Output:

+---+------+
| id|latest|
+---+------+
|  1|    15|
|  2|     4|
|  3|     7|
+---+------+

Upvotes: 2

Related Questions