PySpark: Taking elements of a particular RDD partition

Question

I'm trying to print/take elements of a particular partition. On this question, I've found an elegant way to do it in Scala using this code:

distData.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>it.toList.map(x => if (index ==5) {println(x)}).iterator).collect

I'm struggling in converting this to Python, can you someone help me out here.

P.S: Also, unlike the above solution, I just want to take first 5 elements of the partition, instead of printing it all.

user6022341 · Accepted Answer

You can:

from itertools import islice

rdd.mapPartitions(lambda it: islice(it, 0, 5))

or

rdd.mapPartitionsWithIndex(lambda i, it: islice(it, 0, 5) if i == x else [])

Answers (1)