Reputation: 61

Pandas ffill() equivalent in PySpark

I have a dataframe which has missing values in a row, and I use df.ffill(axis=1, inplace=True) to perform the transformation using Pandas.

I want to understand what would be the PySpark equivalent way to achieve this. I have read about using Window functions but those work over the column axis.

Example :

Input :

id	value1	value2	value3	value4	value5
A	2	3	NaN	NaN	6
B	1	NaN	NaN	NaN	NaN

Output :

id	value1	value2	value3	value4	value5
A	2	3	3	3	6
B	1	1	1	1	1

Upvotes: 0

Answers (1)

seghair tarek

Reputation: 156

You can use coalesce it will take values from value3 column if it's not null, otherwise from value2 column

from pyspark.sql.functions import coalesce

df = df.withColumn('value3', coalesce('value3', 'value2'))

To do this for all your dataset you simply do a for loop on all the columns. Like this :

from pyspark.sql.functions import coalesce

cols = df.columns
for i in range(1,len(cols)):
    df = df.withColumn(cols[i], coalesce(cols[i], cols[i-1]))

Upvotes: 1

Pandas ffill() equivalent in PySpark

Answers (1)

Related Questions