Reputation: 229
I'm using spark for processing large files, I have 12 partitions.
I have rdd1 and rdd2 i make a join between them, than select (rdd3).
My problem is, i consulted that the last partition is too big than other partitions, from partition 1 to partitions 11 45000 recodrs
but the partition 12 9100000 recodrs
.
so i divided 9100000 / 45000 =~ 203
. i repartition my rdd3 into 214(203+11)
but i last partition still too big.
How i can balance the size of my partitions ?
My i write my own custom partitioner?
Upvotes: 0
Views: 874
Reputation: 35249
I have rdd1 and rdd2 i make a join between them
join
is the most expensive operation is Spark. To be able to join by key, you have to shuffle values, and if keys are not uniformly distributed, you get described behavior. Custom partitioner won't help you in that case.
I'd consider adjusting the logic, so it doesn't require a full join.
Upvotes: 1