dimzak
dimzak

Reputation: 2571

Partitioner-Multithreading for many input files in Spring Batch


I have a folder with over 1m xml files and a single-thread step which processes each one of these xml files in the same manner(no connection to a database or anything in common between the files).
Is there a way to make this step more concurrent, like partition using a range of filenames or splitting the files in different folders and use the name of the folders?

As I understand MultiResourcePartitioner can not handle this scenario as it

Creates an ExecutionContext per resource, and labels them as {partition0, partition1, ..., partitionN}. The grid size is ignored.

Upvotes: 1

Views: 2220

Answers (2)

dimzak
dimzak

Reputation: 2571

After some tinkering, the best result came from a custom partitioner which creates partitions based on folders. To achive that, the previous step wrote to folders per 100k xml files.
The code of the partitioner (MultiResource Partitioner helped a lot on how to manage stepExecutions):

public class FolderPartitioner implements Partitioner {

    private static final Logger logger = LoggerFactory.getLogger(FolderPartitioner.class);

    private static final String DEFAULT_KEY_NAME = "fileName";

    private static final String PARTITION_KEY = "partition";

    private String folder;

    private String keyName = DEFAULT_KEY_NAME;

    /**
     * Map each partition to a subfolder of the folder property
     * {@link ExecutionContext}.
     * 
     */
    public Map<String, ExecutionContext> partition(int gridSize) {
        Map<String, ExecutionContext> map = new HashMap<String, ExecutionContext>(
                gridSize);
        int i = 0;
        File dir = new File(folder);
        File[] chunkList = dir.listFiles();
        for (File chunkStep : chunkList) {
            if (chunkStep.isDirectory()) {

                ExecutionContext context = new ExecutionContext();
                context.putString(keyName, chunkStep.getName());
                logger.info("Creating partition for folder:" + context.getString(keyName));
                map.put(PARTITION_KEY + i, context);
                i++;
            }
        }
        return map;
    }

    /**
     * The name of the key for the file name in each {@link ExecutionContext}.
     * Defaults to "fileName".
     * 
     * @param keyName
     *            the value of the key
     */
    public void setKeyName(String keyName) {
        this.keyName = keyName;
    }

    public String getFolder() {
        return folder;
    }

    /**
     * The name of the folder which contains the subfolders for spliting them to steps
     * 
     * @param folder
     */
    public void setFolder(String folder) {
        this.folder = folder;
    }

}


The execution time went from 2 hours to 40 minutes(!!) using this partitioner.

Upvotes: 1

Karthik Prasad
Karthik Prasad

Reputation: 10004

Since you have already have individual files why you need to group to increase the concurrency. If you need to increase the concurrency increase the thread count. in the thread executor. Say suppose you have 1000 files and you have memory and cpu you can set max thread to 50. Hence 50 files will be processed at a time. once the file is processed it will take next set 50 files. Hence execution runs concurrently. Here is an example.

<bean id="kpThreadPool"
    class="org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor"
    destroy-method="destroy">
    <property name="maxPoolSize" value="${app.max_thread_num}" />
</bean>

<batch:step id="kp.step1" next="kp.step2">
        <batch:partition step="kp.slave"
            partitioner="multiResourcePartitioner">
            <batch:handler task-executor="kpThreadPool" />
        </batch:partition>
</batch:step>

where app.max_thread_num=50

Upvotes: 1

Related Questions