Boyu Zhang
Boyu Zhang

Reputation: 237

Spark how can I see data in each partion of a RDD

I am now wishing to test the behavior of repartition() and coalesce() on my own, especially in a not so common situation where numsPartion keeps unchanged, I wish to see will a call of repartition with same partition number will still do a full shuffle on all data. Then I realized that I lack the measure to check the exact content of each partition. I am just using a paralyzed-list as my sample RDD. Is there any way I can inspect the contents of each partition so that I can verify my doubts? Oh maybe there exists other more recent API that can suit this aim? Thanks in advance.

Upvotes: 3

Views: 1171

Answers (1)

ernest_k
ernest_k

Reputation: 45309

You can use RDD.glom(), which

Returns an RDD created by coalescing all elements within each partition into an array.

For an example, the following 8-partition RDD can be inspected using:

val rdd = sc.parallelize(Seq(1,2,3,4,5,6,7,8,9,10))
rdd.glom().collect()

//Result
res3: Array[Array[Int]] = Array(Array(1), Array(2), Array(3), Array(4, 5), 
                                Array(6), Array(7), Array(8), Array(9, 10))

Upvotes: 6

Related Questions