Reputation: 3074
I need to shuffle text file with 2.2*10^9 lines. Is there way how I can load it in spark, then shuffle each partition in parallel(for me it is enough to shuffle within scope of partition) and then spill it back to the file?
Upvotes: 2
Views: 1015
Reputation: 330353
To shuffle only within partitions you can do something like this:
rdd.mapPartitions(new scala.util.Random().shuffle(_))
To shuffle a whole RDD:
rdd.mapPartitions(iter => {
val rng = new scala.util.Random()
iter.map((rng.nextInt, _))
}).partitionBy(new HashPartitioner(rdd.partitions.size)).values
Upvotes: 3