Reputation: 73
I want to generate random vectors with norm 1 in Spark.
Since the vector could be very large, I want it to be distributed, And since data in RDD has no order, I want to store the vector in the form of RDD[(Int, Double)], because I also need to use this vector to do some matrix-vector multiplication.
So how could I generate this kind of vector?
Here is my plan for now:
val v = normalRDD(sc, n, NUM_NODE)
val mod = GetMod(v) // Get the modularity of v
val res = v.map(x => x / mod)
val arr:Array[Double] = res.toArray()
var tuples = new List[(Int, Double)]()
for (i <- 0 to (arr.length - 1)) {
tuples = (i, arr(i)) :: tuples
}
// Get the entries and length of the vector.
entries = sc.parallelize(tuples)
length = arr.length
I think it not elegant enough because it goes through a "distributed -> single node -> distributed" process.
Is there any way better? Thanks:D
Upvotes: 0
Views: 1156
Reputation: 8314
You can use this function to generate a random vector, then you can normalise it by dividing each element on the sum()
of the vector, or by using a normalizer.
Upvotes: 0
Reputation: 3084
try this:
import scala.util.Random
import scala.math.sqrt
val n = 5 // insert length of your array here
val randomRDD = sc.parallelize(for (i <- 0 to n) yield (i, Random.nextDouble))
val norm = sqrt(randomRDD.map(x => x._2 * x._2).sum())
val finalRDD = randomRDD.mapValues(x => x/norm)
Upvotes: 1