Reputation: 8025
I am trying to combine 2 dataframes with the same column names to create a larger new dataframe, keeping all data. I use coalesce to replace values from one dataframe if the same id exists.
The join and coalesce turns out fine, but the joining column ("id") results in null
value when the id exist. Not sure why:
Example:
val tmp = Seq(
(1,"A"),
(2,"B"),
(3,"A"),
(4,"A")
).toDF("id","label")
val tmp2 = Seq(
(1,"B"),
(2,"B"),
(3,"B"),
(5, "A")
).toDF("id","label")
code:
import org.apache.spark.sql.functions._
// replace values in tmp(label) with tmp2(label) if both have same id.
val tmp3 = tmp.join(tmp2, tmp("id") === tmp2("id"), "fullouter")
.select(tmp("id"), coalesce(tmp2("label"), tmp("label")))
output:
+----+----------------------+
| id|coalesce(label, label)|
+----+----------------------+
|null| A|
| 1| B|
| 2| B|
| 3| B|
| 4| A|
+----+----------------------+
Wanted:
output:
+----+----------------------+
| id|coalesce(label, label)|
+----+----------------------+
| 5| A|
| 1| B|
| 2| B|
| 3| B|
| 4| A|
+----+----------------------+
changing the join to "full"
our "outer"
has the same result.
Upvotes: 2
Views: 2891
Reputation: 7928
You are performing a full outer join, so it will combine all the rows from both tables.
Since you are using the id
from tmp
as result id, when there is a row in tmp2
which is absent in tmp
it can't take a valid value so that's null
To get your expected output you can use another coalesce
in the first column, the same as you are doing in the second
val tmp3 = tmp.join(tmp2, tmp("id") === tmp2("id"), "fullouter")
.select(coalesce(tmp("id"), tmp2("id")), coalesce(tmp2("label"), tmp("label")))
+----------------+----------------------+
|coalesce(id, id)|coalesce(label, label)|
+----------------+----------------------+
| 1| B|
| 3| B|
| 5| A|
| 4| A|
| 2| B|
+----------------+----------------------+
Upvotes: 2