Spark scala full join outputs null on joining column

Question

I am trying to combine 2 dataframes with the same column names to create a larger new dataframe, keeping all data. I use coalesce to replace values from one dataframe if the same id exists.

The join and coalesce turns out fine, but the joining column ("id") results in null value when the id exist. Not sure why:

Example:

val tmp = Seq(
  (1,"A"),
  (2,"B"),
  (3,"A"),
  (4,"A")
).toDF("id","label")


val tmp2 = Seq(
  (1,"B"),
  (2,"B"),
  (3,"B"),
  (5, "A")
).toDF("id","label")

code:

import org.apache.spark.sql.functions._

// replace values in tmp(label) with tmp2(label) if both have same id.
val tmp3 = tmp.join(tmp2, tmp("id") === tmp2("id"), "fullouter")
           .select(tmp("id"), coalesce(tmp2("label"), tmp("label")))

output:
+----+----------------------+
|  id|coalesce(label, label)|
+----+----------------------+
|null|                     A|
|   1|                     B|
|   2|                     B|
|   3|                     B|
|   4|                     A|
+----+----------------------+

Wanted:

output:
+----+----------------------+
|  id|coalesce(label, label)|
+----+----------------------+
|   5|                     A|
|   1|                     B|
|   2|                     B|
|   3|                     B|
|   4|                     A|
+----+----------------------+

changing the join to "full" our "outer" has the same result.

SCouto · Accepted Answer

You are performing a full outer join, so it will combine all the rows from both tables.

Since you are using the id from tmp as result id, when there is a row in tmp2 which is absent in tmp it can't take a valid value so that's null

To get your expected output you can use another coalesce in the first column, the same as you are doing in the second

val tmp3 = tmp.join(tmp2, tmp("id") === tmp2("id"), "fullouter")
.select(coalesce(tmp("id"), tmp2("id")), coalesce(tmp2("label"), tmp("label")))
+----------------+----------------------+
|coalesce(id, id)|coalesce(label, label)|
+----------------+----------------------+
|               1|                     B|
|               3|                     B|
|               5|                     A|
|               4|                     A|
|               2|                     B|
+----------------+----------------------+

Spark scala full join outputs null on joining column

Answers (1)

Related Questions