Manish Tripathi
Manish Tripathi

Reputation: 61

Pandas ffill() equivalent in PySpark

I have a dataframe which has missing values in a row, and I use df.ffill(axis=1, inplace=True) to perform the transformation using Pandas.

I want to understand what would be the PySpark equivalent way to achieve this. I have read about using Window functions but those work over the column axis.

Example :

Input :

id value1 value2 value3 value4 value5
A 2 3 NaN NaN 6
B 1 NaN NaN NaN NaN

Output :

id value1 value2 value3 value4 value5
A 2 3 3 3 6
B 1 1 1 1 1

Upvotes: 0

Views: 394

Answers (1)

seghair tarek
seghair tarek

Reputation: 156

You can use coalesce it will take values from value3 column if it's not null, otherwise from value2 column

from pyspark.sql.functions import coalesce

df = df.withColumn('value3', coalesce('value3', 'value2'))

To do this for all your dataset you simply do a for loop on all the columns. Like this :

from pyspark.sql.functions import coalesce

cols = df.columns
for i in range(1,len(cols)):
    df = df.withColumn(cols[i], coalesce(cols[i], cols[i-1]))

Upvotes: 1

Related Questions