Reputation: 761
This is a tough one, but I am sure it is not unheard of.
I have two datasets, Countries and Demographics. The countries dataset contains the name of a country and an ID to it's Demographic data.
The demographic dataset is a hierarchal dataset starting from the country down to the suburb.
Both of these datasets are pulled from a 3rd party on a weekly basis.
I need to split the demographics out into files, one for each country.
So far the steps that i have are 1) Pull Countries 2) Pull Demographics 3) (this is needed) Loop over the country dataset calling a "Write Country Demographics to File"
Is it possible to somehow repeat a step passing the current country id?
EDIT: Added link to sample of PartitionHandler
Thanks JBristow. The below link shows the use of overriding the PartitionHandler to pass parameters using the addArgument of a JavaTask object, but it looks like a lot of heavy lifting by the developer and not very "business problem specific" which is the goal of Spring batch. http://www.activeeon.com/blog/all/integration/distribute-a-spring-batch-job-on-the-proactive-scheduler
I also saw in your original link section 7.4.3. Binding Input Data to Steps this is in the context of 7.4.2. Partitioner, this looks very exciting
<bean id="itemReader" scope="step"
class="org.spr...MultiResourceItemReader">
<property name="resource" value="#{stepExecutionContext[fileName]}/*"/>
</bean>
I don's supose that anyone has some sample XML config of this in play?
Thanks in advance.
Upvotes: 3
Views: 11099
Reputation: 1725
Yes, check out the partitioning feature of spring-batch! http://static.springsource.org/spring-batch/reference/html-single/index.html#partitioning
Basically, it allows you to use a "partitioner" to create new execution contexts to pass to a handler that then does something with that information.
While partitioning was made for parallelization, its default concurrency is 1, so you can start small and ratchet it up to match the hardware at your disposal. Since I assume that each country's data is not dependent on the others (at least in the download demographics step), your job could make use of basic parallelization.
/EDIT: Adding example.
Here's what I do (more or less): First, the XML:
<beans>
<batch:job id="jobName">
<batch:step id="innerStep.master">
<batch:partition partitioner="myPartitioner" step="innerStep"/>
</batch:step>
</batch:job>
<bean id="myPartitioner" class="org.lapseda.MyPartitioner" scope="step">
<property name="jdbcTemplate" ref="jdbcTemplate"/>
<property name="runDate" value="#{jobExecutionContext['runDate']}"/>
<property name="recurrenceId" value="D"/>
</bean>
<batch:step id="summaryDetailsReportStep">
<batch:tasklet>
<batch:chunk reader="someReader" processor="someProcessor" writer="someWriter" commit-interval="10"/>
</batch:tasklet>
</batch:step>
</beans>
And now some Java:
public class MyPartitioner implements Partitioner {
@Override
public Map<String, ExecutionContext> partition(int gridSize) {
List<String> list = getValuesToRunOver();
/* I use treemap because my partitions are ordered, hashmap should work if order isn't important */
Map<String, ExecutionContext> out = new TreeMap<String, ExecutionContext>();
for (String item : list) {
ExecutionContext context = new ExecutionContext();
context.put("key", "value"); // add your own stuff!
out.put("innerStep"+item, context);
}
return out;
}
}
Then you just read from the context like you would from a normal step or job context inside your step.
Upvotes: 12