Reputation: 1101

Spark: Find Each Partition Size for RDD

What's the best way of finding each partition size for a given RDD. I'm trying to debug a skewed Partition issue, I've tried this:

l = builder.rdd.glom().map(len).collect()  # get length of each partition
print('Min Parition Size: ',min(l),'. Max Parition Size: ', max(l),'. Avg Parition Size: ', sum(l)/len(l),'. Total Partitions: ', len(l))

It works fine for small RDDs, but for bigger RDDs, it is giving OOM error. My idea is that glom() is causing this to happen. But anyway, just wanted to know if there is any better way to do it?

Upvotes: 16

Answers (3)

qaziqarta

Reputation: 1944

If anyone came here looking for Scala solution:

// For DataFrame:
df.mapPartitions(it => Iterator(it.size))

// For RDD:
df.rdd.mapPartitions(it => Iterator(it.size))

Upvotes: 2

anwartheravian

Reputation: 1101

While the answer by @LostInOverflow works great. I've found another way to find the size as well as index of each partition, using the code below. Thanks to this awesome post.

Here is the code:

l = test_join.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()

and then you can get the max and min size partitions using this code:

min(l,key=lambda item:item[1])
max(l,key=lambda item:item[1])

Finding the key of the skewed partition, we can further debug the content of the that partition, if needed.

Upvotes: 12

user6022341

Reputation:

Use:

builder.rdd.mapPartitions(lambda it: [sum(1 for _ in it)])

Upvotes: 23

Spark: Find Each Partition Size for RDD

Answers (3)

Related Questions