Gun
Gun

Reputation: 179

How do I replace null values of multiple columns with values from multiple different columns

I have a data frame like below

data = [
(1, None,7,10,11,19),
(1, 4,None,10,43,58),
(None, 4,7,67,88,91),
(1, None,7,78,96,32)

]
df = spark.createDataFrame(data, ["A_min", "B_min","C_min","A_max", "B_max","C_max"])
df.show() 

and I would want the columns which show name as 'min' to be replaced by their equivalent max column. Example null values of A_min column should be replaced by A_max column It should be like the data frame below.

+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
|    1|   11|    7|   10|   11|   19|
|    1|    4|   58|   10|   43|   58|    
|   67|    4|    7|   67|   88|   91|
|    1|   96|    7|   78|   96|   32|
+-----+-----+-----+-----+-----+-----+  

I have tried the code below by defining the columns but clearly this does not work. Really appreciate any help.

min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]

for i in min_cols 
df = df.withColumn(i,when(f.col(i)=='',max_cols.otherwise(col(i))))
display(df)

Upvotes: 0

Views: 330

Answers (1)

noufel13
noufel13

Reputation: 663

Assuming you have the same number of max and min columns, you can use coalesce along with python's list comprehension to obtain your solution

from pyspark.sql.functions import coalesce

min_cols = ["A_min", "B_min","C_min"]
max_cols = ["A_max", "B_max","C_max"]

df.select(*[coalesce(df[val], df[max_cols[pos]]).alias(val) for pos, val in enumerate(min_cols)], *max_cols).show()

Output:

+-----+-----+-----+-----+-----+-----+
|A_min|B_min|C_min|A_max|B_max|C_max|
+-----+-----+-----+-----+-----+-----+
|    1|   11|    7|   10|   11|   19|
|    1|    4|   58|   10|   43|   58|
|   67|    4|    7|   67|   88|   91|
|    1|   96|    7|   78|   96|   32|
+-----+-----+-----+-----+-----+-----+

Upvotes: 3

Related Questions