Reputation: 67
I will try to ask the problem in the general way.
I have a function like this
myFunction (Object first, Object second)
And i have an rdd of Object RDD [Object]
.
I need to perform myFunction on all rdd's elements, in the end of process I have to be sure that all the couples of my object are performed with the myfunction (.., ..)
One way, maybe, is create a broadcast variable (as a copy of my RDD), and than
val broadcastVar = sc.broadcast(rdd.collect())
rdd_line.mapPartitions(p=> {
var brd = broadcastVar.value
var result = new ListBuffer[Double]()
brd.foreach(b => {
p.foreach(e => result+= myfunction(b ,e))
})
result.toList.toIterator
})
There is another way to do this with better performance?
Upvotes: 0
Views: 62
Reputation: 18434
Use RDD's .cartesian
method to get an RDD containing all pairs of elements from the two. In this case, you want the RDD's cartesian with itself:
rdd.cartesian(rdd).map({ case (x, y) => myFunction(x, y) })
Note that this will include pairs of an element with itself, and pairs in both orders, i.e. (a, b) as well as (b, a). And (a, a).
Upvotes: 2