Reputation: 226
I have two dataframe. The set of columns in them is slightly different df1:
+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
| 1| 15| 20| 8|
| 2| 0|null| 5|
+---+----+----+----+
df2:
+---+----+----+----+
| id|col1|col2|col4|
+---+----+----+----+
| 1| 10| 10| 40|
| 2| 10| 30| 50|
+---+----+----+----+
How can pyspark make a left join for df1? But at the same time replace null values with values from df2? And also adding the missing columns from df2
result_df:
id col1 col2 col3 col4
1 15 20 8 40
2 0 30 5 50
I need to combine two data frames with id to get an extra column col4, and for col1, col2, col3, take values from df1, unless the value is non-zero, then replace it with the value from df2.
Upvotes: 1
Views: 779
Reputation: 31460
Use coalesce
function after the left
join.
from pyspark.sql.functions import *
df1.show()
#+---+----+----+----+
#| id|col1|col2|col3|
#+---+----+----+----+
#| 1| 15| 20| 8|
#| 2| 0|null| 5|
#+---+----+----+----+
df2.show()
#+---+----+----+----+----+
#| id|col1|col2|col3|col4|
#+---+----+----+----+----+
#| 1| 15| 20| 8| 40|
#| 2| 0| 30| 5| 50|
#+---+----+----+----+----+
df1.join(df2,["id"],"left").\
select("id",coalesce(df2.col1,df1.col1).alias("col1"),coalesce(df2.col2,df1.col2).alias("col2"),coalesce(df2.col3,df1.col3).alias("col3"),df2.col4).\
show()
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 1| 15| 20| 8| 40|
| 2| 0| 30| 5| 50|
+---+----+----+----+----+
Upvotes: 2