Randomize
Randomize

Reputation: 9103

Apache Camel: parsing csv files with multilines values

I have a one-line csv with the second value composed by multiple lines:

field1,"this

is still

field2","field3"

What I would like to get using Apache Camel, it is a json like this (after parsing the file):

{"field1":"field1","field2":"this

is still

field2","field3":"field3"}

but using the following code:

from('something...')
    .transform(simple('/path/demooneline.csv', File.class))
        .unmarshal().bindy(BindyType.Csv, Demo.class)
        .marshal().json(JsonLibrary.Jackson).log('${body}')

@CsvRecord(separator = ',')
class Demo {

    @JsonView
    @DataField(pos = 1)
    private String field1

    @JsonView
    @DataField(pos = 2)
    private String field2

    @JsonView
    @DataField(pos = 3)
    private String field3

}

I am getting back:

{"field1":"field1","field2":"this","field3":null},
{"field1":"is still","field2":null,"field3":null},
{"field1":null,"field2":"field3","field3":null}

that it looks like the csv separated in 3 lines,instead of 1 line with some fields delimited by quotes. @CsvRecord has "quote" as default. Is there a way to parse that kind of CSV with Camel (using or not bindy)?

Upvotes: 2

Views: 3043

Answers (1)

Ricardo Veguilla
Ricardo Veguilla

Reputation: 3155

The problem is that your CSV file is not "typical". From Wikipidia:

"CSV" is not a single, well-defined format (although see RFC 4180 for one definition that is commonly used). Rather, in practice the term "CSV" refers to any file that:

  1. is plain text using a character set such as ASCII, Unicode, EBCDIC, or Shift JIS,
  2. consists of records (typically one record per line),
  3. with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),
  4. where every record has the same sequence of fields.

In your case, your record spans more than one line, which is why Camel is not parsing it as you expect, Camel is assuming each line is different record.

Edit

As I mentioned in the comment, it looks like Camel Bindy does not handle quoted fields containing line-breaks. As a workaround, you could "preprocess" the source CSV file to replace the line-breaks inside the qoutes. For example, using Guava:

   from("file:///csvSrcDir?noop=true")
        .process(new Processor() {
          @Override
          public void process(Exchange exchange) throws Exception {
            final String inBody = exchange.getIn().getBody(String.class);
            final Iterable<String> tokens = Splitter.on("\",").split(inBody);
            final Iterable<String> fixedTokens = FluentIterable.from(tokens).transform(new Function<String, String>() {
              @Nullable
              @Override
              public String apply(String input) {
                return input.contains("\"\n") ? input : input.replace("\n", "<br>");
              }
            });
            final String outBody = Joiner.on("\",").join(fixedTokens);
            exchange.getOut().setBody(outBody);
          }
        })
        .unmarshal().bindy(BindyType.Csv, Demo.class)
        .split(body())
        .process(new Processor() {
          @Override
          public void process(Exchange exchange) throws Exception {
            Demo body = exchange.getIn().getBody(Demo.class);
          }
        });

The custom Processor converts this CSV file:

"record 1 field1","this

is still

record 1 field2","record 1 field3"
"record 2 field1","this

is still

record line 2 field2","record 2 field3" 

file into:

"record 1 field1","this<br><br>is still<br><br> record 1 field2","record 1 field3"
"record 2 field1","this<br><br>is still<br><br> record 2 field2","record 2 field3"

which Bindy can handle.

Upvotes: 3

Related Questions