Reputation: 3750
I would like to parallelize several but not all steps of my spring batch application. My flow looks like this:
MainStep1: read customers table and create a list of customer config
MainStep2 per customer (If a flow for a single customer fails, do not abort the job):
innerStep1: retrieve all transactions of this customer from transactions table
innerStep2: generate a customer bill from these transactions
innerStep3: email the bill to the customer
MainStep3: aggregate results (which customers succeeded and which ones failed)
MainStep4: email results to the manager
What would be the best way to approach this? I am looking for general advice. I see several concepts, such as: multi-threaded steps, parallel steps, split flows etc.
For clarification, if there are 400 customers in the customers table, I do not want to spin up hundreds of threads in MainStep2.
Another approach would be to drop everything in 1 step:
Reader: read customers table
Composite processor:
processor1: retrieve all transactions of this customer
processor2: generate a customer bill from these transactions
Writer: email the bill to the customer
Step2:
Tasklet1: aggregate results (count success and failure)
Tasklet2: email results to the manager
Problem with the last approach is, there's a lot of logic going in each processor here and it might get overly complex. The goal is to have parts of the flow reusable for many jobs in the future (e.g. how a bill is created differs from a vendor to vendor but sending a bill is the same).
Upvotes: 0
Views: 313
Reputation: 10142
This is how I would approach this problem & I would use partitioning to achieve desired goal - provided you don't partition for each customer but a bulk of customers. Secondly, I would design it as a two step job to achieve better results in case of failures & reruns.
1.First I would try to group customers with some other attributes in addition to CUSTOMER_ID & would try to achieve a grouping of max 10, 50 or 100 groups.
So <CUSTOMER_ID,CUSTOMER_ATTR1, CUSTOMER_ATTR2, ...>
will be your partitioning criteria.
So what I am saying is that you achieve parallelism at step level for a group of customers & not for each customer ( since that is going to be very time consuming as you would be setting up one partitioned step for each customer ).
For better performance, grouping needs to be done wisely keeping in mind a better distribution over all steps.
2.Your concern - I do not want to spin up hundreds of threads is valid & you first limit that concern by point#1 - by fixing number of max partitions irrespective of how many customers you got.
Secondly, setting up of partitioned steps & actually starting a partitioned step are distinct in Spring batch & that is achieved by using an async task executor & setting its concurrency limit.
SimpleAsyncTaskExecutor.setConcurrencyLimit
So at any point of time, you will have max these many steps / threads running in parallel.
You need to set your custom defined async task executor to partitioned step definition / configuration.
3.Within step # 1 transaction ( points 1 & 2 ) , keep marking customers that have successfully been processed as PROCESSED & read DB again for these processed records to prepare the reports that you need to send.
Upvotes: 1