Raphael Roth
Raphael Roth

Reputation: 27373

parse csv, do not split within single OR double quotes

I try to parse a csv with java and have the following issue: The second column is a String (which may also contain comma) enclosed in double-quotes, except if the string itself contains a double quote, then the entire string is enclosed with a single quote. e.g.

Lines may lokk like this:

someStuff,"hello", someStuff
someStuff,"hello, SO", someStuff
someStuff,'say "hello, world"', someStuff
someStuff,'say "hello, world', someStuff

someStuff are placeholders for other elements, which can also include quotes in the same style

I'm looking for a generic way to split the lines at commas UNLESS enclosed in single OR double quotes in order to get the second column as a String. With second column I mean the fields:

I tried OpenCSV but fail as one can only specifiy one type of quote:

public class CSVDemo {

public static void main(String[] args) throws IOException {
    CSVDemo demo = new CSVDemo();
    demo.process("input.csv");
}

public void process(String fileName) throws IOException {
    String file = this.getClass().getClassLoader().getResource(fileName)
            .getFile();
    CSVReader reader = new CSVReader(new FileReader(file));
    String[] nextLine;
    while ((nextLine = reader.readNext()) != null) {
        System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
                + nextLine[2]);
    }
}

}

The solution with opencsv fails on the last line where there is only one double quote enclosed in single quotes:

someStuff | hello |  someStuff
someStuff | hello, SO |  someStuff
someStuff | 'say "hello, world"' |  someStuff
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1

Upvotes: 4

Views: 3524

Answers (5)

Menelaos
Menelaos

Reputation: 25727

It doesn't seem opencv supports this. However, have a look at this previous question and my answer as well as the other answers in case they help you: https://stackoverflow.com/a/15905916/1688441

Below an example, please not notInsideComma actually meant "Inside quotes". The following code could be extended to check for both quotes and double quotes.

public static ArrayList<String> customSplitSpecific(String s)
{
    ArrayList<String> words = new ArrayList<String>();
    boolean notInsideComma = true;
    int start =0, end=0;
    for(int i=0; i<s.length()-1; i++)
    {
        if(s.charAt(i)==',' && notInsideComma)
        {
            words.add(s.substring(start,i));
            start = i+1;                
        }   
        else if(s.charAt(i)=='"')
        notInsideComma=!notInsideComma;
    }
    words.add(s.substring(start));
    return words;
}   

Upvotes: 0

Raphael Roth
Raphael Roth

Reputation: 27373

If the use of single and double quotes is consistent per line, one could chose the corresponding type of quote per line:

public class CSVDemo {
    public static void main(String[] args) throws IOException {
        CSVDemo demo = new CSVDemo();
        demo.process("input.csv");
    }

    public void process(String fileName) throws IOException {
        String file = this.getClass().getClassLoader().getResource(fileName)
                .getFile();

        CSVParser doubleParser = new CSVParser(',', '"');
        CSVParser singleParser = new CSVParser(',', '\'');

        String[] nextLine;

        try (BufferedReader br = new BufferedReader(new FileReader(file))) {
            String line;
            while ((line = br.readLine()) != null) {
                if (line.contains(",'") && line.contains("',")) {
                    nextLine = singleParser.parseLine(line);
                } else {
                    nextLine = doubleParser.parseLine(line);
                }

                System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
                        + nextLine[2]);
            }
        }
    }
}

Upvotes: 0

Andrea Ligios
Andrea Ligios

Reputation: 50203

Basically you only need to track ," and ,' (trimming what's in the middle).

When you encounter one of those, set the appropriate flag (eg. singleQuoteOpen, doubleQuoteOpen) to true to indicate they're open and you are in ignore-commas mode.

When you meet the appropriate closing quote, reset the flag and keep slicing the elements.

To perform the check, stop at every comma (when not in ignore-commas mode) and look at the next char (if any, and trimming).


Note: the regex solution is good and also shorter, but less customizable for edge cases (at least without big headaches).

Upvotes: 1

OldCurmudgeon
OldCurmudgeon

Reputation: 65793

If you truly cannot use a real CSV parser you could use a regex. This is generally not a good idea as there are always edge cases that you cannot handle but if the formatting is strictly as you describe then this may work.

public void test() {
    String[] tests = {"numeStuff,\"hello\", someStuff, someStuff",
        "numeStuff,\"hello, SO\", someStuff, someStuff",
        "numeStuff,'say \"hello, world\"', someStuff, someStuff"
    };
    /* Matches a field and a potentially empty separator.
     *
     *  ( - Field Group
     *     \"  - Start with a quote
     *     [^\"]*? - Non-greedy match on anything that is not a quote
     *     \" - End with a quote
     *   | - Or
     *     '  - Start with a strop
     *     [^']*? - Non-greedy match on anything that is not a strop
     *     ' - End with a strop
     *   | - Or
     *    [^\"'] - Not starting with a quote or strop
     *    [^,$]*? - Non-greedy match on anything that is not a comma or end-of-line
     *  ) - End field group
     *  ( - Separator group
     *   [,$] - Comma separator or end of line
     *  ) - End separator group
     */
    Pattern p = Pattern.compile("(\"[^\"]*?\"|'[^\']*?\'|[^\"'][^,\r\n]*?)([,\r\n]|$)");
    for (String t : tests) {
        System.out.println("Matching: " + t);
        Matcher m = p.matcher(t);
        while (m.find()) {
            System.out.println(m.group(1));
        }
    }
}

Upvotes: 2

user2033671
user2033671

Reputation:

It does not appear that opencsv supports this out of the box. You could extend com.opencsv.CSVParser and implement your own algorithm for handling two types of quotes. This is the source of the method you would be changing and here is a stub to get you started.

class MyCSVParser extends CSVParser{
    @Override
    private String[] parseLine(String nextLine, boolean multi) throws IOException{
        //Your algorithm here
    }
}

Upvotes: 1

Related Questions