Reputation: 1181
I want to access data from a particular partition in Spark RDD. I can get address of a partition as follow:
myRDD.partitions(0)
But I want to get data from myRDD.partitions(0)
partition.
I tried official org.apache.spark documentation but couldn't find.
Thanks in advance.
Upvotes: 9
Views: 7829
Reputation: 11
The easiest way is to use glom()
function which goes through each partition and gets all elements into array and then returns new RDD of arrays of elements of each partition where each array is separate partition.
Let's say we have rdd with data spreading among 5 partitions:
val rdd = sc.parallelize(1 to 20, 5)
Executing rdd.glom.collect
will print:
Array[Array[Int]] = Array(
Array(1, 2, 3, 4),
Array(5, 6, 7, 8),
Array(9, 10, 11, 12),
Array(13, 14, 15, 16),
Array(17, 18, 19, 20)
)
Where each inner array's position is its partition number.
E.g Array(1, 2, 3, 4)
belongs to zeroth partition, Array(5, 6, 7, 8)
to first partition, etc.
Upvotes: 1
Reputation: 330063
You can use mapPartitionsWithIndex
as follows
// Create (1, 1), (2, 2), ..., (100, 100) dataset
// and partition by key so we know what to expect
val rdd = sc.parallelize((1 to 100) map (i => (i, i)), 16)
.partitionBy(new org.apache.spark.HashPartitioner(8))
val zeroth = rdd
// If partition number is not zero ignore data
.mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator())
// Check if we get expected results 8, 16, ..., 96
assert (zeroth.keys.map(_ % 8 == 0).reduce(_ & _) & zeroth.count == 12)
Upvotes: 12