DilTeam
DilTeam

Reputation: 2661

Spark: Joining Dataframes

I need to join two DataFrames as follows:

  1. If keys match, get values from Right side.
  2. If key on the left side doesn't exist on Right. Take values from left.
  3. If key on the right side doesn't exist on Left. Take values from right.

What's the best way to do this using DataFrames? Previously I used to do this using 'cogroup' method of RDD, which is not available on DataFrames.

Upvotes: 3

Views: 9454

Answers (1)

zero323
zero323

Reputation: 330423

You can simply perform FULL OUTER JOIN and use COALESCE

import org.apache.spark.sql.functions.coalesce

val dfLeft = sc.parallelize(Seq((2, "2L"), (3, "3L"))).toDF("kl", "vl")
dfLeft.show

## +---+---+
## | kl| vl|
## +---+---+
## |  2| 2L|
## |  3| 3L|
## +---+---+

val dfRight = sc.parallelize(Seq((1, "1R"), (3, "3R"))).toDF("kr", "vr")
dfRight.show

## +---+---+
## | kr| vr|
## +---+---+
## |  1| 1R|
## |  3| 3R|
## +---+---+

dfLeft
  .join(dfRight, $"kl" === $"kr", "fullouter")
  .select(coalesce($"kl", $"kr").alias("k"), coalesce($"vr", $"vl").alias("v"))
  .show

## +---+---+
## |  k|  v|
## +---+---+
## |  1| 1R|
## |  2| 2L|
## |  3| 3R|
## +---+---+

Upvotes: 11

Related Questions