Spark repartition does not distribute records evenly

Question

I have an rdd which I re-partition by one field

   rdd = rdd.repartition( new Column("block_id"));

and save it to hdfs.

I would expect that if there are 20 different block_id's, the repartitioning would produce 20 new partitions each holding a different block_id. But in fact after repartitioning there are 19 partitions, each holding exactly one block_id and one partition holding two block_id's. This means that the core writing the partition with the two block_id's to disk takes twice the time compared to the other cores and therefore doubling the overall time.

Spark repartition does not distribute records evenly

Answers (1)

Related Questions