Reputation: 27373
I try to parse a csv with java and have the following issue: The second column is a String (which may also contain comma) enclosed in double-quotes, except if the string itself contains a double quote, then the entire string is enclosed with a single quote. e.g.
Lines may lokk like this:
someStuff,"hello", someStuff
someStuff,"hello, SO", someStuff
someStuff,'say "hello, world"', someStuff
someStuff,'say "hello, world', someStuff
someStuff are placeholders for other elements, which can also include quotes in the same style
I'm looking for a generic way to split the lines at commas UNLESS enclosed in single OR double quotes in order to get the second column as a String. With second column I mean the fields:
I tried OpenCSV but fail as one can only specifiy one type of quote:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVReader reader = new CSVReader(new FileReader(file));
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
The solution with opencsv fails on the last line where there is only one double quote enclosed in single quotes:
someStuff | hello | someStuff
someStuff | hello, SO | someStuff
someStuff | 'say "hello, world"' | someStuff
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
Upvotes: 4
Views: 3524
Reputation: 25727
It doesn't seem opencv supports this. However, have a look at this previous question and my answer as well as the other answers in case they help you: https://stackoverflow.com/a/15905916/1688441
Below an example, please not notInsideComma
actually meant "Inside quotes". The following code could be extended to check for both quotes and double quotes.
public static ArrayList<String> customSplitSpecific(String s)
{
ArrayList<String> words = new ArrayList<String>();
boolean notInsideComma = true;
int start =0, end=0;
for(int i=0; i<s.length()-1; i++)
{
if(s.charAt(i)==',' && notInsideComma)
{
words.add(s.substring(start,i));
start = i+1;
}
else if(s.charAt(i)=='"')
notInsideComma=!notInsideComma;
}
words.add(s.substring(start));
return words;
}
Upvotes: 0
Reputation: 27373
If the use of single and double quotes is consistent per line, one could chose the corresponding type of quote per line:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVParser doubleParser = new CSVParser(',', '"');
CSVParser singleParser = new CSVParser(',', '\'');
String[] nextLine;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
if (line.contains(",'") && line.contains("',")) {
nextLine = singleParser.parseLine(line);
} else {
nextLine = doubleParser.parseLine(line);
}
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
}
Upvotes: 0
Reputation: 50203
Basically you only need to track ,"
and ,'
(trimming what's in the middle).
When you encounter one of those, set the appropriate flag (eg. singleQuoteOpen, doubleQuoteOpen) to true to indicate they're open and you are in ignore-commas mode.
When you meet the appropriate closing quote, reset the flag and keep slicing the elements.
To perform the check, stop at every comma (when not in ignore-commas mode) and look at the next char (if any, and trimming).
Note: the regex solution is good and also shorter, but less customizable for edge cases (at least without big headaches).
Upvotes: 1
Reputation: 65793
If you truly cannot use a real CSV parser you could use a regex. This is generally not a good idea as there are always edge cases that you cannot handle but if the formatting is strictly as you describe then this may work.
public void test() {
String[] tests = {"numeStuff,\"hello\", someStuff, someStuff",
"numeStuff,\"hello, SO\", someStuff, someStuff",
"numeStuff,'say \"hello, world\"', someStuff, someStuff"
};
/* Matches a field and a potentially empty separator.
*
* ( - Field Group
* \" - Start with a quote
* [^\"]*? - Non-greedy match on anything that is not a quote
* \" - End with a quote
* | - Or
* ' - Start with a strop
* [^']*? - Non-greedy match on anything that is not a strop
* ' - End with a strop
* | - Or
* [^\"'] - Not starting with a quote or strop
* [^,$]*? - Non-greedy match on anything that is not a comma or end-of-line
* ) - End field group
* ( - Separator group
* [,$] - Comma separator or end of line
* ) - End separator group
*/
Pattern p = Pattern.compile("(\"[^\"]*?\"|'[^\']*?\'|[^\"'][^,\r\n]*?)([,\r\n]|$)");
for (String t : tests) {
System.out.println("Matching: " + t);
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
}
}
}
Upvotes: 2
Reputation:
It does not appear that opencsv supports this out of the box. You could extend com.opencsv.CSVParser
and implement your own algorithm for handling two types of quotes. This is the source of the method you would be changing and here is a stub to get you started.
class MyCSVParser extends CSVParser{
@Override
private String[] parseLine(String nextLine, boolean multi) throws IOException{
//Your algorithm here
}
}
Upvotes: 1