how to join two DataFrame and replace one column conditionally in spark

Question

There are two dataframes. For simplicity, I put them as follow:

DataFrame1

id | name
-----------
0  | Mike
1  | James

DataFrame2

id | name | salary
-------------------
0  | M    | 10
1  | J    | 20
2  | K    | 30

I want to join the two DataFrame on id and only keep the column name in DataFrame1 while keeping the original one if there is no corresponding id in DataFrame2.

It should be:

id | name  | salary
--------------------
0  | Mike  |  10
1  | James |  20
2  | K     |  30

Till now, I only know how to join two dataframes by:

df1.join(df2, df1("id")===df2("id"), "left").select(df2("id"), df1("name"), df2("salary"))

But it will use null to ignore the name value "K".

Thanks!

Tzach Zohar · Accepted Answer

You can use coalesce, which returns the first column that isn't null from the given columns. Plus - using left join you should join df1 to df2 and not the other way around:

import org.apache.spark.sql.functions._

df2.join(df1, df1("id")===df2("id"), "left")
  .select(df2("id"), coalesce(df1("name"), df2("name")), df2("salary"))

how to join two DataFrame and replace one column conditionally in spark

Answers (2)

Related Questions