Reputation: 21
I have an rdd which I re-partition by one field
rdd = rdd.repartition( new Column("block_id"));
and save it to hdfs.
I would expect that if there are 20 different block_id
's, the repartitioning would produce 20 new partitions each holding a different block_id
.
But in fact after repartitioning there are 19 partitions, each holding exactly one block_id
and one partition holding two block_id
's.
This means that the core writing the partition with the two block_id
's to disk takes twice the time compared to the other cores and therefore doubling the overall time.
Upvotes: 1
Views: 1893
Reputation: 35229
Spark Dataset
uses hash partitioning. There is no guarantee that there will be no hash colisions so you cannot expect:
that if there are 20 different block_id's, the repartitioning would produce 20 new partitions each holding a different block_id
You can try to increase number of partitions but it using number which offers good guarantees is rather impractical.
With RDDs you can design your own partitioner How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?
Upvotes: 1