Reputation: 137
I have
Dataset A: uuid, listOfLocationsIds, name
Dataset B: locationId, latitude, longitude
A.listOfLocationIds can have multiple locationIds
How can I do a join on A and B with each value in listOfLocationsIds? So if there are two values in listOfLocationIds, I would want the join to consider each locationId in the listOfLocationIds
A.join(B, A.listOfLocationsIds[0] == B.locationId, "left")
A.join(B, A.listOfLocationsIds[1] == B.locationId, "left")
Upvotes: 0
Views: 355
Reputation: 1601
Assume dataset A is called df
with this content:
+----+-----------------+-----+
|uuid|listOfLocationsId|name |
+----+-----------------+-----+
|1 |[1, 2, 3] |name1|
|2 |[1, 3] |name1|
+----+-----------------+-----+
and dataset B is called df2
with this content:
+----------+--------+---------+
|locationId|latitude|longitude|
+----------+--------+---------+
|2 |5 |7 |
+----------+--------+---------+
And we do an array_contains
join:
df = df.join(df2,
array_contains(col("listOfLocationsId"), col("locationId")), "left"
)
The final result:
+----+-----------------+-----+----------+--------+---------+
|uuid|listOfLocationsId|name |locationId|latitude|longitude|
+----+-----------------+-----+----------+--------+---------+
|1 |[1, 2, 3] |name1|2 |5 |7 |
|2 |[1, 3] |name1|null |null |null |
+----+-----------------+-----+----------+--------+---------+
Good luck!
Upvotes: 1