Reputation: 19905
Is there any Java open source library that supports multi-character (i.e., String with length > 1) separators (delimiters) for CSV?
By definition, CSV = Comma-Separated Values data with a single character (',') as the delimiter. However, many other single-character alternatives exist (e.g., tab), making CSV to stand for "Character-Separated Values" data (essentially, DSV: Delimiter-Separated Values data).
Main Java open source libraries for CSV (e.g., OpenCSV) support virtually any character as the delimiter, but not string (multi-character) delimiters. So, for data separated with strings like "|||" there is no other option than preprocessing the input in order to transform the string to a single-character delimiter. From then on, the data can be parsed as single-character separated values.
It would therefore be nice if there was a library that supported string separators natively, so that no preprocessing was necessary. This would mean that CSV now standed for "CharSequence-Separated Values" data. :-)
Upvotes: 20
Views: 23331
Reputation: 698
Try univocity-parsers, which supports multi-character delimiters and has the best performance.
As for commons-csv: If the input stream is a gzip stream, commons-csv 1.10.0 will incorrectly parse columns delimited by multiple characters, so use it carefully.
Upvotes: 1
Reputation: 182
In 2022 openCSV version 5.7.1 still doesn't support multi-character separator.
Solution - use appache commons-csv, version 1.9.0 support multi-character separator!
CSVFormat.Builder.create().setDelimiter(separator);
Upvotes: 2
Reputation: 11
WorkAround to use delimiter || : Add dummy fields in between the needed columns
public class ClassName {
@CsvBindByPosition(position = 0)
private String column1;
@CsvBindByPosition(position = 1)
private String dummy1;
@CsvBindByPosition(position = 2)
private String column2;
@CsvBindByPosition(position = 3)
private String dummy2;
@CsvBindByPosition(position = 4)
private String column3;
@CsvBindByPosition(position = 5)
private String dummy5;
@CsvBindByPosition(position = 6)
private String column4;
}
And then parse them using
List<ClassName> responses = new CsvToBeanBuilder<ClassName>(new FileReader("test.csv"))
.withType(ClassName.class)
.withSkipLines(1) // to skip header
.withSeparator('|')
// to parse || , we use |
.build()
.parse();
Upvotes: 1
Reputation: 9712
None of these solutions worked for me, because they all assumed you could store the entire CSV file in memory allowing for easy replaceAll
type actions.
I know it's slow, but I went with Scanner
. It has a surprising number of features, and makes rolling your own simple CSV reader with any string you want as a record delimiter. It also lets you parse very large CSV files (I've done 10GB single files before), since you can read records one at a time.
Scanner s = new Scanner(inputStream, "UTF-8").useDelimiter(">|\n");
I would prefer a faster solution, but no library I've found supports it. FasterXML has had an open ticket to add this funcitonality since early 2017: https://github.com/FasterXML/jackson-dataformats-text/issues/14
Upvotes: 1
Reputation: 78011
This is a good question. The problem was not obvious to me until I looked at the javadocs and realised that opencsv only supports a character as a separator, not a string....
Here's a couple of suggested work-arounds (Examples in Groovy can be converted to java).
Continue to Use OpenCSV, but ignore the empty fields. Obviously this is a cheat, but it will work fine for parsing well-behaved data.
CSVParser csv = new CSVParser((char)'|')
String[] result = csv.parseLine('J||Project report||"F, G, I"||1')
assert result[0] == "J"
assert result[2] == "Project report"
assert result[4] == "F, G, I"
assert result[6] == "1"
or
CSVParser csv = new CSVParser((char)'|')
String[] result = csv.parseLine('J|||Project report|||"F, G, I"|||1')
assert result[0] == "J"
assert result[3] == "Project report"
assert result[6] == "F, G, I"
assert result[9] == "1"
Use the Java String tokenizer method.
def result = 'J|||Project report|||"F, G, I"|||1'.tokenize('|||')
assert result[0] == "J"
assert result[1] == "Project report"
assert result[2] == "\"F, G, I\""
assert result[3] == "1"
Disadvantage of this approach is that you lose the ability to ignore quote characters or escape separators..
Instead of pre-processing the data, altering it's content, why not combine both of the above approaches in a two step process:
Not very efficient, but possibly easier that writing your own CSV parser :-)
Upvotes: 5
Reputation: 425278
Try opencsv.
It does everything you need, including (and especially) handling embedded delimiters within quoted values (eg "a,b", "c"
parses as ["a,b", "c"]
)
I've used it successfully and I liked it.
Since opencsv handles only single-character separators, you could work around this thus:
String input;
char someCharNotInInput = '|';
String delimiter = "abc"; // or whatever
input.replaceAll(delimiter, someCharNotInInput);
new CSVReader(input, someCharNotInInput); // etc
// Put it back into each value read
value.replaceAll(someCharNotInInput, delimiter); // in case it's inside delimiters
Upvotes: -2