Join on two RDDs using Scala in Spark

Question

I am trying to implement Local Outlier Factor on Spark. So I have a set of points that I read from a file and then for each point find the N nearest neighbors. Each point has an index given to it using zipWithIndex() command

So Now I have two RDDs Firstly

RDD[(Index:Long, Array[(NeighborIndex:Long, Distance:Double)])]

Where Long represents its index, and the Array consist of its N nearest neighbors with the Long representing the Index position of these neighbors and Double Representing their Distance from the given point

Second

RDD[(Index:Long,LocalReachabilityDensity:Double)]

Here, Long again represents the Index of a given point, and Double represents its Local Reachability density

What I want, is an RDD, which contains all the points, and an array of their N closest neighbors and their Local Reachability density

RDD[(Index:Long, Array[(NeighborIndex:Long,LocalReachabilityDensity:Double)])]

So basically here, Long would represent the index of a point, and the array would be of its N closest neighbors, with their index values and Local Reachability density.

According to my understanding, I need to run a map on the first RDD, and then join the values in its array with the second RDD that contain the Local Reachability densities, to get Local Reachability density for all the given indexes of its N neighbors. But I am not sure how to achieve this. If any one can help me out that would be great

Join on two RDDs using Scala in Spark

Answers (1)

Related Questions