Reputation: 33
I am trying to match user logins with the closest city in an efficient manner.
Starting with two RDDs with the following:
I would like to join these two to the following format based on the closest city as calculated by the haver-sin function.
In Scala I do this with a double for loop, but this is not allowed in Spark. I have tried to use the Cartesian( rdd1.Cartesian(rdd2) ) and then reducing, but this gives me a massive N*M matrix.
Is there a faster more space efficient way of joining these RDDs based on the shortest haver-sin distance?
Upvotes: 3
Views: 1020
Reputation: 330393
One way to approach this is to completely avoid the join
. Assuming that #cities
<< #user
(in other words RDD1.count
<< RDD2.count
) the most efficient approach to simply map
over users:
RDD2
to a local data structurebroadcast
it and use for mappingIf RDD2 is to large to be stored in memory but is small enough to be passed using a single file you can easily adjust this approach by replacing local data structure with solution like SpatiaLite:
SparkFiles
)Finally, if none of the above works for you, be smart about the way you join
:
Upvotes: 1