Vikash Pareek
Vikash Pareek

Reputation: 1181

How to get data from a specific partition in Spark RDD?

I want to access data from a particular partition in Spark RDD. I can get address of a partition as follow:

myRDD.partitions(0)

But I want to get data from myRDD.partitions(0) partition. I tried official org.apache.spark documentation but couldn't find.

Thanks in advance.

Upvotes: 9

Views: 7829

Answers (2)

dev-blogs
dev-blogs

Reputation: 11

The easiest way is to use glom() function which goes through each partition and gets all elements into array and then returns new RDD of arrays of elements of each partition where each array is separate partition.

Let's say we have rdd with data spreading among 5 partitions:

val rdd = sc.parallelize(1 to 20, 5)

Executing rdd.glom.collect will print:

Array[Array[Int]] = Array(
   Array(1, 2, 3, 4), 
   Array(5, 6, 7, 8),
   Array(9, 10, 11, 12), 
   Array(13, 14, 15, 16),
   Array(17, 18, 19, 20)
)

Where each inner array's position is its partition number. E.g Array(1, 2, 3, 4) belongs to zeroth partition, Array(5, 6, 7, 8) to first partition, etc.

Upvotes: 1

zero323
zero323

Reputation: 330063

You can use mapPartitionsWithIndex as follows

// Create (1, 1), (2, 2), ..., (100, 100) dataset
// and partition by key so we know what to expect
val rdd = sc.parallelize((1 to 100) map (i => (i, i)), 16)
  .partitionBy(new org.apache.spark.HashPartitioner(8))

val zeroth = rdd
  // If partition number is not zero ignore data
  .mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator())

// Check if we get expected results 8, 16, ..., 96
assert (zeroth.keys.map(_ % 8 == 0).reduce(_ & _) & zeroth.count == 12)

Upvotes: 12

Related Questions