How to get the joining key in Left join in Apache Spark

Question

I have the below DF with me -

scala> val df1=Seq( ("1","1_10"), ("1","1_11"), ("2","2_20"), ("3","3_30"), ("3","3_31") )toDF("c1","c2")

+---+----+
| c1|  c2|
+---+----+
|  1|1_10|
|  1|1_11|
|  2|2_20|
|  3|3_30|
|  3|3_31|
+---+----+

val df2=Seq( ("2","200"), ("3","300") )toDF("c1","val")

+---+---+
| c1| val|
+---+---+
|  2|200|
|  3|300|
+---+---+

If, I make a left join , I will get the result as below.

  scala> df1.join(df2,Seq("c1"),"left").select(df1("c1").alias("df1_c1"),df1("c2"),df2("val")).show
+------+----+----+
|df1_c1|  c2| val|
+------+----+----+
|     1|1_10|null|
|     1|1_11|null|
|     2|2_20| 200|
|     3|3_30| 300|
|     3|3_31| 300|
+------+----+----+

But, how could I get the joining key val of the right table ?

Expected output -

+------+----+----+------+
|df1_c1|  c2| val|df2_c1|
+------+----+----+------+
|     1|1_10|null|  null|
|     1|1_11|null|  null|
|     2|2_20| 200|     2|
|     3|3_30| 300|     3|
|     3|3_31| 300|     3|
+------+----+----+------+

If I try ,  df1.join(df2,Seq("c1"),"left").select(df1("c1").alias("df1_c1"),df1("c2"),df2("val"),df2("c1")).show,

I will get the below error -

org.apache.spark.sql.AnalysisException: Resolved attribute(s) c1#19639 missing from c1#19630,c2#19631,val#19640 in operator !Project [c1#19630 AS df1_c1#19667, c2#19631, val#19640, c1#19639]. Attribute(s) with the same name appear in the operation: c1. Please check if the right attribute(s) are used.;

falcon-le0 · Accepted Answer

Spark drops duplicate columns if you use sequence of columns Seq("c1") for join. You can use custom join expression instead:

df1.as("df1").join(df2.as("df2"), expr("df1.c1 == df2.c1"),"left").select($"df1.c1".alias("df1_c1"), $"df1.c2", $"df2.c1".as("df2_c1"), $"df2.val").show(false)

How to get the joining key in Left join in Apache Spark

Answers (1)

Related Questions