Reputation: 61
I have a dataframe which has missing values in a row, and I use
df.ffill(axis=1, inplace=True)
to perform the transformation using Pandas.
I want to understand what would be the PySpark equivalent way to achieve this. I have read about using Window functions but those work over the column axis.
Example :
Input :
id | value1 | value2 | value3 | value4 | value5 |
---|---|---|---|---|---|
A | 2 | 3 | NaN | NaN | 6 |
B | 1 | NaN | NaN | NaN | NaN |
Output :
id | value1 | value2 | value3 | value4 | value5 |
---|---|---|---|---|---|
A | 2 | 3 | 3 | 3 | 6 |
B | 1 | 1 | 1 | 1 | 1 |
Upvotes: 0
Views: 394
Reputation: 156
You can use coalesce
it will take values from value3
column if it's not null
, otherwise from value2
column
from pyspark.sql.functions import coalesce
df = df.withColumn('value3', coalesce('value3', 'value2'))
To do this for all your dataset you simply do a for loop on all the columns. Like this :
from pyspark.sql.functions import coalesce
cols = df.columns
for i in range(1,len(cols)):
df = df.withColumn(cols[i], coalesce(cols[i], cols[i-1]))
Upvotes: 1