Non Distributed Data _ Spring Batch Partitioner

Question

Just take a case where I am reading data from a Database with conditions (millions of rows), doing some business on data and then updating it.

I am using a Column Range Partitioner ( Id column ) taking Min and Max Ids to process, and creating partitions with max-min/gridSize.

Now imagine i have the Ids 1, 22, 23, 24, 30 with gridSize =3 , with that logic i will have 3 partitions:

partition1 processing Id 1

partition2 processing 0 Rows

partition3 processing 22, 23, 24 and 30

With millions of Data, parallel processing like this isn't useful, and trying to recover all the data in a single request for implementing distributed partitioning takes forever..

What's the best solution?

Michael Minella · Accepted Answer

The ColumnRangePartitioner showed in the examples states that you need an evenly distributed column for it to be effective (as you have noted). Instead, you can typically add a row number to your query and partition on that since it will be a sequence over the results.

An example of the SQL would look something like this (for MySQL):

SELECT F.*, 
       @rownum := @rownum + 1 AS rank 
FROM FOO F, 
     (SELECT @rownum := 0) r;

With that, the column rank would be a sequence autogenerated each time you run the query. From that value, you could partition the dataset. Since this isn't persistent, you'd need to do some gymnastics to get the right ids, but the basic logic of your Partitioner implementation would look something like this:

Run count query to find out how many records there are in your data set.
Run a query using above technique to figure out what the db id is for the start and the end of each partition range. This will give you the ids to filter by per partition.
Create a partition for each pair (start/end) using the actual db ids.
Setup your ItemReader to read the items only within the range of the db ids provided.

Non Distributed Data _ Spring Batch Partitioner

Answers (1)

Related Questions