Reputation: 9415
JDBCRDD
is potentially partitioned for efficient query parallelization on the database.
Is there a way to carry over how the data is partitioned as a useful hint to the next stage, potentially a groupBy
, without having to repartition the data?
Example: I am loading date/region/value. With JDBCRDD
I am loading the data partitioned by Date. If I want to reduce/groupBy date and region, I should again not cause a sort and shuffle for date, and leverage the fact that the RDD is already partitioning by date.
In a pseudo API, I would do something as follows:
RDD rdd = new JDCBCRDD ...
Partitioner partitioning = (Row r)->p(r)
rdd.assertPartitioning(partitioning);
RDD<Pair<Key,Row>> rdd2 = rdd.groupWithinPartition((r)->f(r),Rowoperator::sum);
So now in theory, all my groupings are to be executed JVM instance local, same node, same JVM, same thread.
Upvotes: 3
Views: 270