I'm trying to figure out how to join two Dataframes where one of the dataframes contains a list that has the key to join by. For example: one dataframe has the format (id: String, ...,...) the other has (... ,..., List[tmpObj]) tmpObj is structured like this (id: String, value: int) . My goal is to join the two dataframes by the id values found in the second dataFrame's list of tmpObj to the id from the first dataframe. Does anyone have any idea of how to go about this?

scalaapache-sparkapache-spark-sqlleft-join

Reputation: 35

How to join with dataset with column as the collection of keys to join by?

I'm trying to figure out how to join two Dataframes where one of the dataframes contains a list that has the key to join by.

For example:

one dataframe has the format (id: String, ...,...)
the other has (... ,..., List[tmpObj])
tmpObj is structured like this (id: String, value: int).

My goal is to join the two dataframes by the id values found in the second dataFrame's list of tmpObj to the id from the first dataframe.

Does anyone have any idea of how to go about this?

Upvotes: 3

Answers (3)

Tzach Zohar

Reputation: 37852

Can use the built-in SQL function array_contains:

import org.apache.spark.sql.functions._

df1.join(df2, expr("array_contains(list.id, id)"), "left")

NOTE: you may be tempted to try using the org.apache.spark.sql.functions.array_config directly (not within a call to expr, i.e. df1.join(df2, array_contains($"list.id", $"id"), "left")), but that won't do in this case, as its signature limits its usage to cases where the second argument is some constant value (expects Any but fails if the argument's type is Column).

Upvotes: 2

Leo C

Reputation: 22449

Use explode and join the first dataframe on id:

df2.withColumn("obj", explode($"list_obj")).
  join(df1, $"obj.id" === $"id", "right")

Upvotes: 2

Raphael Roth

Reputation: 27383

you can write an UDF to check whether the list contains the given id:

val df1 = Seq("a","b","c").toDF("id")

case class TmpObject(id:String,value:Int)

val df2 = Seq(
  Seq(
    TmpObject("a",1),
    TmpObject("c",2)
  )
).toDF("list")

val arrayContains = udf((xs:Seq[String],s:String) => xs.contains(s))

df1.join(df2,arrayContains($"list.id",$"id"),"left")
  .show()


+---+--------------+
| id|          list|
+---+--------------+
|  a|[[a,1], [c,2]]|
|  b|          null|
|  c|[[a,1], [c,2]]|
+---+--------------+

Upvotes: 1

How to join with dataset with column as the collection of keys to join by?

Answers (3)

Related Questions