N.Mittal
N.Mittal

Reputation: 19

Join of two Dataframes using multiple columns as keys stored in an Array in Apache Spark

How to calculate the join of two Dataframes using multiple columns as key? For example DF1 , DF2 are the two dataFrame.

This is the way by which we can calculate the join,

JoinDF = DF1.join(DF2, DF1("column1") === DF2("column11") && DF1("column2") === DF2("column22"), "outer") 

But my problem is how to access the multiple columns if they are stored in an arrays like :

DF1KeyArray=Array{column1,column2}
DF2KeyArray=Array{column11,column22}

then It is not possible to calculate the join by this method

JoinDF = DF1.join(DF2, DF1(DF1KeyArray)=== DF2(DF2KeyArray), "outer")

In this case error was :

<console>:128: error: type mismatch;
found   : Array[String]
required: String

Is there any way to access multiple columns as keys stored in an Array for calculation of join?

Upvotes: 0

Views: 6452

Answers (1)

zero323
zero323

Reputation: 330113

You can simply create joinExprs programmatically:

val df1KeyArray: Array[String] = ???
val df2KeyArray: Array[String] = ???

val df1: DataFrame = ???
val df2: DataFrame = ???

val joinExprs = df1KeyArray
  .zip(df2KeyArray)
  .map{case (c1, c2) => df1(c1) === df2(c2)}
  .reduce(_ && _)

df1.join(df2, joinExprs, "outer")

See also Including null values in an Apache Spark Join

Upvotes: 10

Related Questions