Reputation: 95
There are two dataframes. For simplicity, I put them as follow:
DataFrame1
id | name
-----------
0 | Mike
1 | James
DataFrame2
id | name | salary
-------------------
0 | M | 10
1 | J | 20
2 | K | 30
I want to join the two DataFrame on id
and only keep the column name
in DataFrame1 while keeping the original one if there is no corresponding id
in DataFrame2.
It should be:
id | name | salary
--------------------
0 | Mike | 10
1 | James | 20
2 | K | 30
Till now, I only know how to join two dataframes by:
df1.join(df2, df1("id")===df2("id"), "left").select(df2("id"), df1("name"), df2("salary"))
But it will use null to ignore the name value "K".
Thanks!
Upvotes: 6
Views: 11808
Reputation: 29155
to replace null values you can use DataFrameNaFunctions like below...
df1.join(df2, df1("id")===df2("id"), "left_outer")
.select(df2("id"), df1("name"), df2("salary"))
.na.fill(ImmutableMap.of("name", "unknown")).show()
where 'unknown' is sample value. you can replace with value you wanted...
if you don't want rows with null value columns
val joined = df1.join(df2, df1("id")===df2("id"), "left_outer")
.select(df2("id"), df1("name"), df2("salary"))
val final = joined.where(joined.col("name").isNotNull)
final.show()
Also Note that as mentioned in @Tzach Zohar answer
def coalesce(e: Column*)
function
Returns the first column that is not null, or null if all inputs are null.
If you are looking for that kind of ... then you can go ahead.
Upvotes: 1
Reputation: 37822
You can use coalesce
, which returns the first column that isn't null
from the given columns. Plus - using left
join you should join df1
to df2
and not the other way around:
import org.apache.spark.sql.functions._
df2.join(df1, df1("id")===df2("id"), "left")
.select(df2("id"), coalesce(df1("name"), df2("name")), df2("salary"))
Upvotes: 13