Reputation: 43
Just take a case where I am reading data from a Database with conditions (millions of rows), doing some business on data and then updating it.
I am using a Column Range Partitioner ( Id column ) taking Min and Max Ids to process, and creating partitions with max-min/gridSize.
Now imagine i have the Ids 1, 22, 23, 24, 30 with gridSize =3 , with that logic i will have 3 partitions:
partition1 processing Id 1
partition2 processing 0 Rows
partition3 processing 22, 23, 24 and 30
With millions of Data, parallel processing like this isn't useful, and trying to recover all the data in a single request for implementing distributed partitioning takes forever..
What's the best solution?
Upvotes: 0
Views: 515
Reputation: 21463
The ColumnRangePartitioner
showed in the examples states that you need an evenly distributed column for it to be effective (as you have noted). Instead, you can typically add a row number to your query and partition on that since it will be a sequence over the results.
An example of the SQL would look something like this (for MySQL):
SELECT F.*,
@rownum := @rownum + 1 AS rank
FROM FOO F,
(SELECT @rownum := 0) r;
With that, the column rank
would be a sequence autogenerated each time you run the query. From that value, you could partition the dataset. Since this isn't persistent, you'd need to do some gymnastics to get the right ids, but the basic logic of your Partitioner
implementation would look something like this:
ItemReader
to read the items only within the range of the db ids provided.Upvotes: 1