Spark RDD (not a PairRDD) with a custom Partitioner

Question

Can I create a Spark RDD (not a PairRDD) with a custom Partitioner? I don't seem to find anything in the API that would allow that... The partitionBy method only works on PairRDDs

Bartosz Konieczny · Accepted Answer

AFAIK, you can't and my understanding why is the following:

When Apache Spark reads data, it considers it as a kind of black box*. So the framework is unable to say, "Oh, here I've a line X, so I have to put it into the partition 1" at the very beginning step, where it has no idea about what it's inside. Instead the framework will use a lot of different parameters like number of partitions, split size and so on to figure out how many data should be read from given source in every task (params will depend on the source). So the idea is to assign smaller parts of a big dataset into tasks (partitions) rather than analyzing each line/row/record/whatever and saying where it could land. Even for natively partitioned data sources like Apache Kafka, Spark works that way, i.e. without interpreting the data for the partitioning. IMO that's one of the main differences between distributed data processing framework and a distributed data store where sometimes you can define your own partitioning logic but only because you're receiving some specific data instead of a "bag" of data. In other terms, Spark's partition is more tied to a data source partitioning logic to exploit the parallelism of the source for the initial reading.

Another point is that explicit partitionBy is also your intent. By doing it you're saying that the pipeline will need to have all data for this specific key at the same partition, so you could do aggregate operations or any other grouping ones.

Also, if you take a look at org.apache.spark.rdd.RDD#partitioner, you'll see that it's involved mostly for the operations involving shuffle - so something the user wanted. It's not used to distribute the data read at the very beginning of the computation.

So to sum-up and clarify a little, I would distinguish 2 categories for partitioning. The first one concerns data sources and here you need to play with the configuration properties exposed by the framework. The second one is business logic partitioner where, after converting a flat RDD into a pair RDD, and here the operation is considered as a grouping operation since it expresses an intent to having all similar data on the same partition to do something on it (aggregates, session generation, ...)

* - not always. For instance, when you're using JDBC with Spark SQL, you can define a column used for partitioning which will be used as a kind of range partitioning with key. But it's more thanks to the organization of the storage (structured data).

Spark RDD (not a PairRDD) with a custom Partitioner

Answers (2)

Related Questions