How do I replace null values of multiple columns with values from multiple different columns

Question

I have a data frame like below

data = [
(1, None,7,10,11,19),
(1, 4,None,10,43,58),
(None, 4,7,67,88,91),
(1, None,7,78,96,32)

]
df = spark.createDataFrame(data, ["A_min", "B_min","C_min","A_max", "B_max","C_max"])
df.show()

and I would want the columns which show name as 'min' to be replaced by their equivalent max column. Example null values of A_min column should be replaced by A_max column It should be like the data frame below.

+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
|    1|   11|    7|   10|   11|   19|
|    1|    4|   58|   10|   43|   58|    
|   67|    4|    7|   67|   88|   91|
|    1|   96|    7|   78|   96|   32|
+-----+-----+-----+-----+-----+-----+

I have tried the code below by defining the columns but clearly this does not work. Really appreciate any help.

min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]

for i in min_cols 
df = df.withColumn(i,when(f.col(i)=='',max_cols.otherwise(col(i))))
display(df)

noufel13 · Accepted Answer

Assuming you have the same number of max and min columns, you can use coalesce along with python's list comprehension to obtain your solution

from pyspark.sql.functions import coalesce

min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]

df.select(*[coalesce(df[val], df[max_cols[pos]]).alias(val) for pos, val in enumerate(min_cols)], *max_cols).show()

Output:

+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
|    1|   11|    7|   10|   11|   19|
|    1|    4|   58|   10|   43|   58|
|   67|    4|    7|   67|   88|   91|
|    1|   96|    7|   78|   96|   32|
+-----+-----+-----+-----+-----+-----+

How do I replace null values of multiple columns with values from multiple different columns

Answers (1)

Related Questions