Reputation: 82
I need to write a regular expression for string read from a file
apple,boy,cat,"dog,cat","time\" after\"noon"
I need to split it into
apple boy cat dog,cat time"after"noon
I tried using
Pattern pattern =
Pattern.compile("[\\\"]");
String items[]=pattern.split(match);
for the second part but I could not get the right answer,can you help me with this?
Upvotes: 3
Views: 261
Reputation: 350
Since your question is more of a parsing problem than a regex problem, here's another solution that will work:
public class CsvReader {
Reader r;
int row, col;
boolean endOfRow;
public CsvReader(Reader r){
this.r = r instanceof BufferedReader ? r : new BufferedReader(r);
this.row = -1;
this.col = 0;
this.endOfRow = true;
}
/**
* Returns the next string in the input stream, or null when no input is left
* @return
* @throws IOException
*/
public String next() throws IOException {
int i = r.read();
if(i == -1)
return null;
if(this.endOfRow){
this.row++;
this.col = 0;
this.endOfRow = false;
} else {
this.col++;
}
StringBuilder b = new StringBuilder();
outerLoop:
while(true){
char c = (char) i;
if(i == -1)
break;
if(c == ','){
break;
} else if(c == '\n'){
endOfRow = true;
break;
} else if(c == '\\'){
i = r.read();
if(i == -1){
break;
} else {
b.append((char)i);
}
} else if(c == '"'){
while(true){
i = r.read();
if(i == -1){
break outerLoop;
}
c = (char)i;
if(c == '\\'){
i = r.read();
if(i == -1){
break outerLoop;
} else {
b.append((char)i);
}
} else if(c == '"'){
r.mark(2);
i = r.read();
if(i == '"'){
b.append('"');
} else {
r.reset();
break;
}
} else {
b.append(c);
}
}
} else {
b.append(c);
}
i = r.read();
}
return b.toString().trim();
}
public int getColNum(){
return col;
}
public int getRowNum(){
return row;
}
public static void main(String[] args){
try {
String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"\nquick\"fix\" hello, \"\"\"who's there?\"";
System.out.println(input);
Reader r = new StringReader(input);
CsvReader csv = new CsvReader(r);
String s;
while((s = csv.next()) != null){
System.out.println("R" + csv.getRowNum() + "C" + csv.getColNum() + ": " + s);
}
} catch(IOException e){
e.printStackTrace();
}
}
}
Running this code, I get the output:
R0C0: apple
R0C1: boy
R0C2: cat
R0C3: dog,cat
R0C4: time" after"noon
R1C0: quickfix hello
R1C1: "who's there?
This should fit your needs pretty well.
A few disclaimers, though:
Edit: Looked up the csv format, discovered there's no real standard, but updated my code to catch quotes escaped by doubling rather than backslashes.
Edit 2: Fixed. Should work as advertised now. Also modified it to test the tracking of row and column numbers.
Upvotes: 3
Reputation: 16354
I am not really sure about this but you could have a go at Pattern.compile("[\\\\"]");
\
is an escape character and to detect a \
in the expression, \\\\
could be used.
A similar thing worked for me in another context and I hope it solves your problem too.
Upvotes: 0
Reputation: 350
First thing: String.split() uses the regex to find the separators, not the substrings.
Edit: I'm not sure if this can be done with String.split(). I think the only way you could deal with the quotes while only matching the comma would be by readahead and lookbehind, and that's going to break in quite a lot of cases.
Edit2: I'm pretty sure it can be done with a regular expression. And I'm sure this one case could be solved with string.split() -- but a general solution wouldn't be simple.
Basically, you're looking for anything that isn't a comma as input [^,], you can handle quotes as a separate character. I've gotten most of the way there myself. I'm getting this as output:
apple
boy
cat
dog
cat
time\" after\"noon
But I'm not sure why it has so many blank lines.
My complete code is:
String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"";
Pattern pattern =
Pattern.compile("(\\s|[^,\"\\\\]|(\\\\.)||(\".*\"))*");
Matcher m = pattern.matcher(input);
while(m.find()){
System.out.println(m.group());
}
But yeah, I'll echo the guy above and say that if there's no requirement to use a regular expression, then it's probably simpler to do it manually.
But then I guess I'm almost there. It's spitting out ... oh hey, I see what's going on here. I think I can fix that.
But I'm going to echo the guy above and say that if there's no requirement to use a regular expression, it's probably better to do it one character at a time and implement the logic manually. If your regex isn't picture-perfect, then it could cause all kinds of unpredictable weirdness down the line.
Upvotes: 0