iamorozov
iamorozov

Reputation: 811

Apache Spark partitions distribution strategy

There are partitioning strategies in Apache Spark. Hash partitioning, Range partitioning and an ability to write custom partitioners. But how are partitions distribute by cluster nodes? Is there a way to affect this somehow?

Upvotes: 2

Views: 678

Answers (1)

Bartosz Konieczny
Bartosz Konieczny

Reputation: 2033

Partitions distribution in Spark relies on the data source and on your configuration. The partitioners you quote are used during manual repartitioning operations, such as coalesce or repartition. When you decide to do so, Spark will sometimes shuffle the data between nodes (if shuffle flat set to true). The partitioners are also used in some RDD-based operations, as for instance RDD.sortByKey that looks like:

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)] = self.withScope 
{ 
   val part = new RangePartitioner(numPartitions, self, ascending)
   new ShuffledRDD[K, V, V](self, part).setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

Regarding the partitions during data reading, it depends on source types. For Kafka, it'll be the partitions of a topic, for HDFS a file split and for an RDBMS source a numerical column and AFAIK, the partitioners aren't involved here. Some time ago I wrote some posts about partitioning in Spark (and in Spark SQL). If you're interested, you can take a look:

Upvotes: 3

Related Questions