Reputation: 329
I want to initialize an RDD which contains n number of pairs of zero.
For example: n = 3
, the expected result will be:
init: RDD[(Long, Long)] = ((0,0), (0,0), (0,0))
I need to initialize n number of pairs of RDDs. It could be thousands, or hundred thousand, even millions. If I do it using for loop with Scala code, then transform it to an RDD. It will take a long time.
var init: List[(Long, Long)] = List((0,0))
for(a <- 1 to 1000000){
init = init :+ (0L,0L)
}
val pairRDD: RDD[(Long, Long)] = sc.parallelize(init)
Can anybody give me direction how to do it
Upvotes: 0
Views: 743
Reputation: 215117
You can use spark.range
to initialize the rdd in parallel from start:
val rdd = spark.range(1000000).map(_ => (0, 0)).rdd
// rdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[13] at rdd at <console>:23
rdd.take(5)
// res9: Array[(Int, Int)] = Array((0,0), (0,0), (0,0), (0,0), (0,0))
Upvotes: 4