Hlib
Hlib

Reputation: 3074

Is there way how to shuffle collection in Spark

I need to shuffle text file with 2.2*10^9 lines. Is there way how I can load it in spark, then shuffle each partition in parallel(for me it is enough to shuffle within scope of partition) and then spill it back to the file?

Upvotes: 2

Views: 1015

Answers (1)

zero323
zero323

Reputation: 330353

To shuffle only within partitions you can do something like this:

rdd.mapPartitions(new scala.util.Random().shuffle(_))

To shuffle a whole RDD:

rdd.mapPartitions(iter => {
  val rng = new scala.util.Random()
  iter.map((rng.nextInt, _))
}).partitionBy(new HashPartitioner(rdd.partitions.size)).values

Upvotes: 3

Related Questions