Astrum
Astrum

Reputation: 621

Reading CSV data in Spring Batch (creating a custom LineMapper)

I've been doing a bit of work writing some batch processing code on CSV data. I found a tutorial online and so far have been using it without really understanding how or why it works, which means I'm unable to solve a problem I'm currently facing.

The code I'm working with is below:

 @Bean
    public LineMapper<Employee> lineMapper() {
        DefaultLineMapper<Employee> lineMapper = new DefaultLineMapper<Employee>();
        DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
        lineTokenizer.setNames(new String[] { "id", "firstName", "lastName" });
        lineTokenizer.setIncludedFields(new int[] { 0, 1, 2 });
        BeanWrapperFieldSetMapper<Employee> fieldSetMapper = new BeanWrapperFieldSetMapper<Employee>();
        fieldSetMapper.setTargetType(Employee.class);
        lineMapper.setLineTokenizer(lineTokenizer);
        lineMapper.setFieldSetMapper(fieldSetMapper);
        return lineMapper;
    }

I'm not entirely clear on what setNames or setIncludedFields is really doing. I've looked through the docs, but still don't know what's happening under the hood. Why do we need to give names to the lineTokenizer? Why can't it just be told how many columns of data there will be? Is its only purpose so that the fieldSetMapper knows which fields to map to which data objects (do they all need to be named the same as the fields in the POJO?)?

I have a new problem where I have CSVs with a large amount of columns (about 25-35) that I need to process. Is there a way to generate the columns in setNames programmatically with the variable names of the POJOs, rather than editing them in by hand?

Edit:

An example input file may be something like:

test.csv:
field1, field2, field3,
a,b,c
d,e,f
g,h,j

The DTO:

public class Test {

    private String field1;
    private String field2;
    private String field3;

   //setters and getters and constructor

Upvotes: 4

Views: 4807

Answers (1)

Mahmoud Ben Hassine
Mahmoud Ben Hassine

Reputation: 31600

I see the confusion, so I will try to clarify how key interfaces work together. A LineMapper is responsible for mapping a single line from your input file to an instance of your domain type. The default implementation provided by Spring Batch is the DefaultLineMapper, which delegates the work to two collaborators:

  • LineTokenizer: which takes a String and tokenizes it into a FieldSet (which is similar to the ResultSet in the JDBC world, where you can get fields by index or name)
  • FieldSetMapper: which maps the FieldSet to an instance of your domain type

So the process is: String -> FieldSet -> Object:

enter image description here

Each interface comes with a default implementation, but you can provide your own if needed.

DelimitedLineTokenizer

The names attribute in DelimitedLineTokenizer is used to create named fields in the FieldSet. This allows you to get a field by name from the FieldSet (again, similar to ResultSet methods where you can get a field by name). The includedFields allows to select a subset of fields from your input file, just like in your use case where you have 25 fields and you only need to extract a subset of fields.

BeanWrapperFieldSetMapper

This FieldSetMapper implementation expects a type and uses the JavaBean naming conventions for getters/setters to set fields on the target object from the FieldSet.

Is there a way to generate the columns in setNames programmatically with the variable names of the POJOs, rather than editing them in by hand?

This is what the BeanWrapperFieldSetMapper will do. If you provide field names in the FieldSet, the mapper will call the setter of each field having the same name. The name matching is fuzzy in the sense that it tolerates close matches, here is an excerpt from the Javadoc:

Property name matching is "fuzzy" in the sense that it tolerates close matches,
as long as the match is unique. For instance:

* Quantity = quantity (field names can be capitalised)
* ISIN = isin (acronyms can be lower case bean property names, as per Java Beans recommendations)
* DuckPate = duckPate (capitalisation including camel casing)
* ITEM_ID = itemId (capitalisation and replacing word boundary with underscore)
* ORDER.CUSTOMER_ID = order.customerId (nested paths are recursively checked)

This mapper is also configurable with a custom ConversionService if needed. If this still does not cover your use case, you need to provide a custom mapper.

Upvotes: 4

Related Questions