Scott
Scott

Reputation: 137

spark join with column multiple values in list

I have

Dataset A: uuid, listOfLocationsIds, name

Dataset B: locationId, latitude, longitude

A.listOfLocationIds can have multiple locationIds

How can I do a join on A and B with each value in listOfLocationsIds? So if there are two values in listOfLocationIds, I would want the join to consider each locationId in the listOfLocationIds

A.join(B, A.listOfLocationsIds[0] == B.locationId, "left")
A.join(B, A.listOfLocationsIds[1] == B.locationId, "left")

Upvotes: 0

Views: 355

Answers (1)

vilalabinot
vilalabinot

Reputation: 1601

Assume dataset A is called df with this content:

+----+-----------------+-----+
|uuid|listOfLocationsId|name |
+----+-----------------+-----+
|1   |[1, 2, 3]        |name1|
|2   |[1, 3]           |name1|
+----+-----------------+-----+

and dataset B is called df2 with this content:

+----------+--------+---------+
|locationId|latitude|longitude|
+----------+--------+---------+
|2         |5       |7        |
+----------+--------+---------+

And we do an array_contains join:

df = df.join(df2, 
  array_contains(col("listOfLocationsId"), col("locationId")), "left"
)

The final result:

+----+-----------------+-----+----------+--------+---------+
|uuid|listOfLocationsId|name |locationId|latitude|longitude|
+----+-----------------+-----+----------+--------+---------+
|1   |[1, 2, 3]        |name1|2         |5       |7        |
|2   |[1, 3]           |name1|null      |null    |null     |
+----+-----------------+-----+----------+--------+---------+

Good luck!

Upvotes: 1

Related Questions