PNS
PNS

Reputation: 19905

Java CSV parser with string separator (multi-character)

Is there any Java open source library that supports multi-character (i.e., String with length > 1) separators (delimiters) for CSV?

By definition, CSV = Comma-Separated Values data with a single character (',') as the delimiter. However, many other single-character alternatives exist (e.g., tab), making CSV to stand for "Character-Separated Values" data (essentially, DSV: Delimiter-Separated Values data).

Main Java open source libraries for CSV (e.g., OpenCSV) support virtually any character as the delimiter, but not string (multi-character) delimiters. So, for data separated with strings like "|||" there is no other option than preprocessing the input in order to transform the string to a single-character delimiter. From then on, the data can be parsed as single-character separated values.

It would therefore be nice if there was a library that supported string separators natively, so that no preprocessing was necessary. This would mean that CSV now standed for "CharSequence-Separated Values" data. :-)

Upvotes: 20

Views: 23331

Answers (6)

Shuai Liu
Shuai Liu

Reputation: 698

Try univocity-parsers, which supports multi-character delimiters and has the best performance.

As for commons-csv: If the input stream is a gzip stream, commons-csv 1.10.0 will incorrectly parse columns delimited by multiple characters, so use it carefully.

Upvotes: 1

Andrei Filipchyk
Andrei Filipchyk

Reputation: 182

In 2022 openCSV version 5.7.1 still doesn't support multi-character separator.

Solution - use appache commons-csv, version 1.9.0 support multi-character separator!

CSVFormat.Builder.create().setDelimiter(separator);

Upvotes: 2

Niranjan Ravichandran
Niranjan Ravichandran

Reputation: 11

WorkAround to use delimiter || : Add dummy fields in between the needed columns

public class ClassName {
    @CsvBindByPosition(position = 0)
    private String column1;
    @CsvBindByPosition(position = 1)
    private String dummy1;
    @CsvBindByPosition(position = 2)
    private String column2;
    @CsvBindByPosition(position = 3)
    private String dummy2;
    @CsvBindByPosition(position = 4)
    private String column3;
    @CsvBindByPosition(position = 5)
    private String dummy5;
    @CsvBindByPosition(position = 6)
    private String column4;
}
And then parse them using 
List<ClassName> responses = new CsvToBeanBuilder<ClassName>(new FileReader("test.csv"))
                .withType(ClassName.class)
                .withSkipLines(1) // to skip header
                .withSeparator('|')
                // to parse || , we use |
                .build()
                .parse();

Upvotes: 1

Peter
Peter

Reputation: 9712

None of these solutions worked for me, because they all assumed you could store the entire CSV file in memory allowing for easy replaceAll type actions.

I know it's slow, but I went with Scanner. It has a surprising number of features, and makes rolling your own simple CSV reader with any string you want as a record delimiter. It also lets you parse very large CSV files (I've done 10GB single files before), since you can read records one at a time.

Scanner s = new Scanner(inputStream, "UTF-8").useDelimiter(">|\n");

I would prefer a faster solution, but no library I've found supports it. FasterXML has had an open ticket to add this funcitonality since early 2017: https://github.com/FasterXML/jackson-dataformats-text/issues/14

Upvotes: 1

Mark O&#39;Connor
Mark O&#39;Connor

Reputation: 78011

This is a good question. The problem was not obvious to me until I looked at the javadocs and realised that opencsv only supports a character as a separator, not a string....

Here's a couple of suggested work-arounds (Examples in Groovy can be converted to java).

Ignore implicit intermediary fields

Continue to Use OpenCSV, but ignore the empty fields. Obviously this is a cheat, but it will work fine for parsing well-behaved data.

    CSVParser csv = new CSVParser((char)'|')

    String[] result = csv.parseLine('J||Project report||"F, G, I"||1')

    assert result[0] == "J"
    assert result[2] == "Project report"
    assert result[4] == "F, G, I"
    assert result[6] == "1"

or

    CSVParser csv = new CSVParser((char)'|')

    String[] result = csv.parseLine('J|||Project report|||"F, G, I"|||1')

    assert result[0] == "J"
    assert result[3] == "Project report"
    assert result[6] == "F, G, I"
    assert result[9] == "1"

Roll your own

Use the Java String tokenizer method.

    def result = 'J|||Project report|||"F, G, I"|||1'.tokenize('|||')

    assert result[0] == "J"
    assert result[1] == "Project report"
    assert result[2] == "\"F, G, I\""
    assert result[3] == "1"

Disadvantage of this approach is that you lose the ability to ignore quote characters or escape separators..

Update

Instead of pre-processing the data, altering it's content, why not combine both of the above approaches in a two step process:

  1. Use the "roll your own" to first validate the data. Split each line and prove that it contains the requiste number of fields.
  2. Use the "field ignoring" approach to parse the validated data, secure in the knowledge that the correct number of fields have been specified.

Not very efficient, but possibly easier that writing your own CSV parser :-)

Upvotes: 5

Bohemian
Bohemian

Reputation: 425278

Try opencsv.

It does everything you need, including (and especially) handling embedded delimiters within quoted values (eg "a,b", "c" parses as ["a,b", "c"])

I've used it successfully and I liked it.

Edited:

Since opencsv handles only single-character separators, you could work around this thus:

String input;
char someCharNotInInput = '|';
String delimiter = "abc"; // or whatever
input.replaceAll(delimiter, someCharNotInInput);
new CSVReader(input, someCharNotInInput); // etc
// Put it back into each value read
value.replaceAll(someCharNotInInput, delimiter); // in case it's inside delimiters

Upvotes: -2

Related Questions