Reputation: 19
How to calculate the join of two Dataframes using multiple columns as key? For example DF1 , DF2 are the two dataFrame.
This is the way by which we can calculate the join,
JoinDF = DF1.join(DF2, DF1("column1") === DF2("column11") && DF1("column2") === DF2("column22"), "outer")
But my problem is how to access the multiple columns if they are stored in an arrays like :
DF1KeyArray=Array{column1,column2}
DF2KeyArray=Array{column11,column22}
then It is not possible to calculate the join by this method
JoinDF = DF1.join(DF2, DF1(DF1KeyArray)=== DF2(DF2KeyArray), "outer")
In this case error was :
<console>:128: error: type mismatch;
found : Array[String]
required: String
Is there any way to access multiple columns as keys stored in an Array for calculation of join?
Upvotes: 0
Views: 6452
Reputation: 330113
You can simply create joinExprs
programmatically:
val df1KeyArray: Array[String] = ???
val df2KeyArray: Array[String] = ???
val df1: DataFrame = ???
val df2: DataFrame = ???
val joinExprs = df1KeyArray
.zip(df2KeyArray)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
df1.join(df2, joinExprs, "outer")
See also Including null values in an Apache Spark Join
Upvotes: 10