Reputation: 1349
My spark application processes input CSV file in several stages. At each stage, the RDD's created are large in size(in MB's to GB's). I have created different number of partitions and tried but always the partitions complete the stage unevenly. Some partitions finish a stage very soon but last few partitions always take lot of time and keep throwing heartbeat timeout and executor lost failures and keeps retrying.
I have tried changing the number of partitions to different values but always the last few partitions never complete. I could not fix this issue after trying for long too.
How do I handle this?
Upvotes: 0
Views: 770
Reputation: 330183
Most likely the source of the problem is this part:
aggregateByKey(Vector.empty[(Long, Int)])(_ :+ _, _ ++ _)
Basically what you're doing is a significantly less efficient version of groupByKey
. If distribution of keys is skewed then distribution after the aggregation will be skewed as and result in uneven load on different machines.
Moreover if data for a single key won't fit into main memory a whole process will fail for the same reason as with groupByKey
.
Upvotes: 3