Reputation: 1101
What's the best way of finding each partition size for a given RDD. I'm trying to debug a skewed Partition issue, I've tried this:
l = builder.rdd.glom().map(len).collect() # get length of each partition
print('Min Parition Size: ',min(l),'. Max Parition Size: ', max(l),'. Avg Parition Size: ', sum(l)/len(l),'. Total Partitions: ', len(l))
It works fine for small RDDs, but for bigger RDDs, it is giving OOM error. My idea is that glom()
is causing this to happen. But anyway, just wanted to know if there is any better way to do it?
Upvotes: 16
Views: 20495
Reputation: 1944
If anyone came here looking for Scala solution:
// For DataFrame:
df.mapPartitions(it => Iterator(it.size))
// For RDD:
df.rdd.mapPartitions(it => Iterator(it.size))
Upvotes: 2
Reputation: 1101
While the answer by @LostInOverflow works great. I've found another way to find the size as well as index of each partition, using the code below. Thanks to this awesome post.
Here is the code:
l = test_join.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
and then you can get the max and min size partitions using this code:
min(l,key=lambda item:item[1])
max(l,key=lambda item:item[1])
Finding the key of the skewed partition, we can further debug the content of the that partition, if needed.
Upvotes: 12