Is there way how to shuffle collection in Spark

Question

I need to shuffle text file with 2.2*10^9 lines. Is there way how I can load it in spark, then shuffle each partition in parallel(for me it is enough to shuffle within scope of partition) and then spill it back to the file?

zero323 · Accepted Answer

To shuffle only within partitions you can do something like this:

rdd.mapPartitions(new scala.util.Random().shuffle(_))

To shuffle a whole RDD:

rdd.mapPartitions(iter => {
  val rng = new scala.util.Random()
  iter.map((rng.nextInt, _))
}).partitionBy(new HashPartitioner(rdd.partitions.size)).values

Is there way how to shuffle collection in Spark

Answers (1)

Related Questions