Ben Jen
Ben Jen

Reputation: 35

How to join with dataset with column as the collection of keys to join by?

I'm trying to figure out how to join two Dataframes where one of the dataframes contains a list that has the key to join by.

For example:

My goal is to join the two dataframes by the id values found in the second dataFrame's list of tmpObj to the id from the first dataframe.

Does anyone have any idea of how to go about this?

Upvotes: 3

Views: 1451

Answers (3)

Tzach Zohar
Tzach Zohar

Reputation: 37822

Can use the built-in SQL function array_contains:

import org.apache.spark.sql.functions._

df1.join(df2, expr("array_contains(list.id, id)"), "left")

NOTE: you may be tempted to try using the org.apache.spark.sql.functions.array_config directly (not within a call to expr, i.e. df1.join(df2, array_contains($"list.id", $"id"), "left")), but that won't do in this case, as its signature limits its usage to cases where the second argument is some constant value (expects Any but fails if the argument's type is Column).

Upvotes: 2

Leo C
Leo C

Reputation: 22439

Use explode and join the first dataframe on id:

df2.withColumn("obj", explode($"list_obj")).
  join(df1, $"obj.id" === $"id", "right")

Upvotes: 2

Raphael Roth
Raphael Roth

Reputation: 27373

you can write an UDF to check whether the list contains the given id:

val df1 = Seq("a","b","c").toDF("id")

case class TmpObject(id:String,value:Int)

val df2 = Seq(
  Seq(
    TmpObject("a",1),
    TmpObject("c",2)
  )
).toDF("list")

val arrayContains = udf((xs:Seq[String],s:String) => xs.contains(s))

df1.join(df2,arrayContains($"list.id",$"id"),"left")
  .show()


+---+--------------+
| id|          list|
+---+--------------+
|  a|[[a,1], [c,2]]|
|  b|          null|
|  c|[[a,1], [c,2]]|
+---+--------------+

Upvotes: 1

Related Questions