KyBe
KyBe

Reputation: 842

Perform an operation of one element over the rest of the RDD

i'm new to spark and i really enjoy the possibilities offered by this technology. My problem is how to perform an operation with one element over the rest of the RDD for each element without use a for loop. Here's my try with a for loop :

 //RDD[Key:Int,Vector:(Double,Double)]
 val rdd = data.map(x => (x.split(',')(0).toInt,Vectors.dense(x.split(',')(1).toDouble,x.split(',')(2).toDouble)))

 for( ind <- 0 to rdd.count().toInt -1 ) {
   val element1 = rdd.filter(x => x._1 == ind)
   val vector1 = element1.first()._2
   val rdd2 = rdd.map( x => {
        var dist1 = Vectors.sqdist(x._2,vector1)    
        (x._1 , Math.sqrt(dist1))
        })
 }

Thank you for your advices

Upvotes: 2

Views: 259

Answers (1)

Shyamendra Solanki
Shyamendra Solanki

Reputation: 8851

If you are looking for finding distances between all vectors, use rdd.cartesian :

import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(Array("0,1.0,1.0","1,2.0,2.0","2,3.0,3.0"))
val r = rdd.map(x => x.split(","))
           .map(y =>(y(0).toInt, Vectors.dense(y(1).toDouble, y(2).toDouble)))

val res =  r.cartesian(r).map{ case (first, second) => 
   ((first._1, second._1), 
    Math.sqrt(Vectors.sqdist(first._2, second._2))) 
}

However it computes, distances between same vectors, twice. (first (A,B) then (B,A))

Upvotes: 2

Related Questions