user1272855
user1272855

Reputation: 82

regular expression for \" in java

I need to write a regular expression for string read from a file

apple,boy,cat,"dog,cat","time\" after\"noon"

I need to split it into

apple
boy
cat
dog,cat
time"after"noon

I tried using

Pattern pattern = 
Pattern.compile("[\\\"]");
String items[]=pattern.split(match);

for the second part but I could not get the right answer,can you help me with this?

Upvotes: 3

Views: 261

Answers (3)

Captain Ford
Captain Ford

Reputation: 350

Since your question is more of a parsing problem than a regex problem, here's another solution that will work:

public class CsvReader {

    Reader r;
    int row, col;
    boolean endOfRow;

    public CsvReader(Reader r){
        this.r = r instanceof BufferedReader ? r : new BufferedReader(r);
        this.row = -1;
        this.col = 0;
        this.endOfRow = true;
    }

    /**
     * Returns the next string in the input stream, or null when no input is left
     * @return
     * @throws IOException  
     */
    public String next() throws IOException {
        int i = r.read();
        if(i == -1)
            return null;

        if(this.endOfRow){
            this.row++;
            this.col = 0;
            this.endOfRow = false;
        } else {
            this.col++;
        }

        StringBuilder b = new StringBuilder();
outerLoop:  
        while(true){
            char c = (char) i;
            if(i == -1)
                break;
            if(c == ','){
                break;
            } else if(c == '\n'){
                endOfRow = true;
                break;
            } else if(c == '\\'){
                i = r.read();
                if(i == -1){
                    break;
                } else {
                    b.append((char)i);
                }
            } else if(c == '"'){
                while(true){
                    i = r.read();

                    if(i == -1){
                        break outerLoop;
                    }
                    c = (char)i;
                    if(c == '\\'){
                        i = r.read();
                        if(i == -1){
                            break outerLoop;
                        } else {
                            b.append((char)i);
                        }
                    } else if(c == '"'){
                        r.mark(2);
                        i = r.read();
                        if(i == '"'){
                            b.append('"');
                        } else {
                            r.reset();
                            break;
                        }
                    } else {
                        b.append(c);
                    }
                }
            } else {
                b.append(c);
            }
            i = r.read();
        }

        return b.toString().trim();
    }


    public int getColNum(){
        return col;
    }

    public int getRowNum(){
        return row;
    }

    public static void main(String[] args){

        try {
            String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"\nquick\"fix\" hello, \"\"\"who's there?\"";
            System.out.println(input);
            Reader r = new StringReader(input);
            CsvReader csv = new CsvReader(r);
            String s;
            while((s = csv.next()) != null){
                System.out.println("R" + csv.getRowNum() + "C" + csv.getColNum() + ": " + s);
            }
        } catch(IOException e){
            e.printStackTrace();
        }
    }
}

Running this code, I get the output:

R0C0: apple
R0C1: boy
R0C2: cat
R0C3: dog,cat
R0C4: time" after"noon
R1C0: quickfix hello
R1C1: "who's there?

This should fit your needs pretty well.

A few disclaimers, though:

  • It won't catch errors in the syntax of the CSV format, such as an unescaped quotation mark in the middle of a value.
  • It won't perform any character conversion (such as converting "\n" to a newline character). Backslashes simply cause the following character to be treated literally, including other backslashes. (That should be easy enough to alter if you need additional functionality)
  • Some csv files escape quotes by doubling them rather than using a backslash, this code now looks for both.

Edit: Looked up the csv format, discovered there's no real standard, but updated my code to catch quotes escaped by doubling rather than backslashes.

Edit 2: Fixed. Should work as advertised now. Also modified it to test the tracking of row and column numbers.

Upvotes: 3

Swayam
Swayam

Reputation: 16354

I am not really sure about this but you could have a go at Pattern.compile("[\\\\"]");

\ is an escape character and to detect a \ in the expression, \\\\ could be used.

A similar thing worked for me in another context and I hope it solves your problem too.

Upvotes: 0

Captain Ford
Captain Ford

Reputation: 350

First thing: String.split() uses the regex to find the separators, not the substrings.

Edit: I'm not sure if this can be done with String.split(). I think the only way you could deal with the quotes while only matching the comma would be by readahead and lookbehind, and that's going to break in quite a lot of cases.

Edit2: I'm pretty sure it can be done with a regular expression. And I'm sure this one case could be solved with string.split() -- but a general solution wouldn't be simple.

Basically, you're looking for anything that isn't a comma as input [^,], you can handle quotes as a separate character. I've gotten most of the way there myself. I'm getting this as output:

apple

boy

cat


dog

cat



time\" after\"noon

But I'm not sure why it has so many blank lines.

My complete code is:

String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"";

Pattern pattern =
        Pattern.compile("(\\s|[^,\"\\\\]|(\\\\.)||(\".*\"))*");
Matcher m = pattern.matcher(input);

while(m.find()){
    System.out.println(m.group());
}

But yeah, I'll echo the guy above and say that if there's no requirement to use a regular expression, then it's probably simpler to do it manually.

But then I guess I'm almost there. It's spitting out ... oh hey, I see what's going on here. I think I can fix that.

But I'm going to echo the guy above and say that if there's no requirement to use a regular expression, it's probably better to do it one character at a time and implement the logic manually. If your regex isn't picture-perfect, then it could cause all kinds of unpredictable weirdness down the line.

Upvotes: 0

Related Questions