Reputation: 35
I'm trying to figure out how to join two Dataframes where one of the dataframes contains a list that has the key to join by.
For example:
(id: String, ...,...)
(... ,..., List[tmpObj])
tmpObj
is structured like this (id: String, value: int)
.My goal is to join the two dataframes by the id values found in the second dataFrame's list of tmpObj to the id from the first dataframe.
Does anyone have any idea of how to go about this?
Upvotes: 3
Views: 1451
Reputation: 37822
Can use the built-in SQL function array_contains
:
import org.apache.spark.sql.functions._
df1.join(df2, expr("array_contains(list.id, id)"), "left")
NOTE: you may be tempted to try using the org.apache.spark.sql.functions.array_config
directly (not within a call to expr
, i.e. df1.join(df2, array_contains($"list.id", $"id"), "left")
), but that won't do in this case, as its signature limits its usage to cases where the second argument is some constant value (expects Any
but fails if the argument's type is Column
).
Upvotes: 2
Reputation: 22439
Use explode
and join
the first dataframe on id
:
df2.withColumn("obj", explode($"list_obj")).
join(df1, $"obj.id" === $"id", "right")
Upvotes: 2
Reputation: 27373
you can write an UDF to check whether the list contains the given id:
val df1 = Seq("a","b","c").toDF("id")
case class TmpObject(id:String,value:Int)
val df2 = Seq(
Seq(
TmpObject("a",1),
TmpObject("c",2)
)
).toDF("list")
val arrayContains = udf((xs:Seq[String],s:String) => xs.contains(s))
df1.join(df2,arrayContains($"list.id",$"id"),"left")
.show()
+---+--------------+
| id| list|
+---+--------------+
| a|[[a,1], [c,2]]|
| b| null|
| c|[[a,1], [c,2]]|
+---+--------------+
Upvotes: 1