YoYo
YoYo

Reputation: 9415

Partitioning hint for custom RDD

JDBCRDD is potentially partitioned for efficient query parallelization on the database.

Is there a way to carry over how the data is partitioned as a useful hint to the next stage, potentially a groupBy, without having to repartition the data?

Example: I am loading date/region/value. With JDBCRDD I am loading the data partitioned by Date. If I want to reduce/groupBy date and region, I should again not cause a sort and shuffle for date, and leverage the fact that the RDD is already partitioning by date.

In a pseudo API, I would do something as follows:

RDD rdd = new JDCBCRDD ...
Partitioner partitioning = (Row r)->p(r)
rdd.assertPartitioning(partitioning);
RDD<Pair<Key,Row>> rdd2 = rdd.groupWithinPartition((r)->f(r),Rowoperator::sum);

So now in theory, all my groupings are to be executed JVM instance local, same node, same JVM, same thread.

Upvotes: 3

Views: 270

Answers (2)

YoYo
YoYo

Reputation: 9415

Partitioning is controlled by the hash value of the elements in the RDD. To avoid shuffling going into the next stage, you basically need to guarantee that the same hash value is being generated. You do this by overriding the hashCode method.

Upvotes: 0

Ambling
Ambling

Reputation: 446

If what you mean is need to keep the info of the partition index with each element, I think mapWith is what you need. You can group the partition index with the data into a new class and pass to the next stage.

Upvotes: 1

Related Questions