Join two dataframes with different records and size in Spark

Question

It seems this issue asked couple of times, but the solutions that suggested in previous questions not working for me.

I have two dataframes with different dimensions as shown in picture below. The table two second was part of table one first but after some processing on it I added one more column column4. Now I want to join these two tables such that I have table three Required after joining.

Things that tried.

So I did couple of different solution but no one works for me.

I tried

val required =first.join(second, first("PDE_HDR_CMS_RCD_NUM") === second("PDE_HDR_CMS_RCD_NUM") , "left_outer")

Also I tried

val required = first.withColumn("SEQ", when(second.col("PDE_HDR_FILE_ID") === (first.col("PDE_HDR_FILE_ID").alias("PDE_HDR_FILE_ID1")), second.col("uniqueID")).otherwise(lit(0)))

In the second attempt I used .alias after I get an error that says

Error occured during extract process. Error: org.apache.spark.sql.AnalysisException: Resolved attribute(s) uniqueID#775L missing from.

Thanks for taking time to read my question

Leo C · Accepted Answer

To generate the wanted result, you should join the two tables on column(s) that are row-identifying in your first table. Assuming c1 + c2 + c3 uniquely identifies each row in the first table, here's an example using a partial set of your sample data:

import org.apache.spark.sql.functions._
import spark.implicits._

val df1 = Seq(
  (1, "e", "o"),
  (4, "d", "t"),
  (3, "f", "e"),
  (2, "r", "r"),
  (6, "y", "f"),
  (5, "t", "g"),
  (1, "g", "h"),
  (4, "f", "j"),
  (6, "d", "k"),
  (7, "s", "o")
).toDF("c1", "c2", "c3")

val df2 = Seq(
  (3, "f", "e", 444),
  (5, "t", "g", 555),
  (7, "s", "o", 666)
).toDF("c1", "c2", "c3", "c4")

df1.join(df2, Seq("c1", "c2", "c3"), "left_outer").show
// +---+---+---+----+
// | c1| c2| c3|  c4|
// +---+---+---+----+
// |  1|  e|  o|null|
// |  4|  d|  t|null|
// |  3|  f|  e| 444|
// |  2|  r|  r|null|
// |  6|  y|  f|null|
// |  5|  t|  g| 555|
// |  1|  g|  h|null|
// |  4|  f|  j|null|
// |  6|  d|  k|null|
// |  7|  s|  o| 666|
// +---+---+---+----+

Join two dataframes with different records and size in Spark

Answers (1)

Related Questions