Line breaks in field treated as end of line while parsing csv file

Question

IN a csv file that I have a record that renders like this:

,"SKYY SPA MARTINI

 2 oz. SKYY Vodka
 Fresh cucumber
 Fresh mint
 Splash of simple syrup

 Muddle cucumber & mint with syrup.
 Add SKYY Vodka and shake with ice. 
 Strain into a chilled martini glass. 
 Garnish with a fresh mint sprig and cucumber slice.",

with each line ending with a LF carriage return.

I thought that this would be treated as a string and the carriage returns wouldn't be treated as new lines, but this isn't the case, and is breaking my script. Is there a way to have the reader only have line breaks parsed if they're not flanked by quotes? I'm currently using this as my code, couldn't find a setting for the tokenizer that would allow me to perform this action.

        // instantiate description line mapper
    DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
    DefaultLineMapper lineMapper = new DefaultLineMapper<>();

    lineMapper.setLineTokenizer(lineTokenizer);
    lineMapper.setFieldSetMapper(fieldSetMapper);

    // set description line mapper
    reader.setLineMapper(lineMapper);

    return reader;

Phil · Accepted Answer

Inspired by this CSV regex post, I have written a quick-and-dirty method for doing this:

public static void main(String[] args) {
    String line = "\"BEEP\",\"BOOP\",\"TWO SHOTS\rOF VODKA\"\r\"BOOP\",\"BEEP\",\"LEMON\rWEDGES\"";

    String quote = "\"";
    String splitter = "\r";
    String delimiter = ",";

    parse(line, delimiter, quote, splitter);
}

public static void parse(String data, String delimiter, String quote, String splitter) {
    String regex = splitter+"(?=(?:[^"+quote+"]*\"[^"+quote+"]*\")*[^"+quote+"]*$)";

    String[] lines = data.split(regex, -1);

    List records = new ArrayList();

    for(String line : lines) {
        records.add(line.split(delimiter, -1));
    }

    for(String[] line : records) {
        for(String record : line) {
            System.out.println("RECORD: " + record); //do whatever
        }
    }
}

Of course, considering the large size of some CSV files, you will need to chug along with a StringBuilder and likely use myStringBuilder.toString().split(regex, -1); for the parse method.

This is likely not the Spring way of doing things. But as Jim Garrison commented, this is an edge case that I'm not sure if Spring has ways of solving.

A more complex regex may be required if the records start using other nasty characters (commas, quotes, etc.). I don't know what the source of these records could be, but some sanitizing may be in order before splitting the file.

Line breaks in field treated as end of line while parsing csv file

Answers (1)

Related Questions