boardkeystown
boardkeystown

Reputation: 184

Regex to split string and preserve content within double quotes

Yes I know this has been asked a lot but there is no solution I've found for what exactly I'm trying to do. So please allow me to explain what my problem is.

I need to find a way so tokenize a string based on ',' , '.', white space, and between quotes without applying other regex rules between the quotes.

Allow this '[]' to represent a single space for these examples.

Suppose I have a string like this:

ADD[]r2,[]r3

Now with a regex like this:

((?<=\s)|(?=\s+))|((?<=,))|(?=\.)

I can split the string like so:

1: ADD
2: []
3: r2,
4: []
5: r3

This is what I want.

Now suppose I have a string like this:

"ADD[]r2,[]r3"[]"foo[]bar"

Now with a regex like this:

(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)

I can split the string like this:

1: "ADD[]r2,[]r3"
2: []
3: "foo[]bar"

But if I had a string like this:

ADD[]r2,[]r3[]"ADD[]r2,[]r3"

And used a regex like this:

(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)((?<=\s)|(?=\s+))|((?<=,))|(?=\.)

I would end up with something like this:

1:ADD
2:[]
3:r2,
4:[]
5:r3
6:[]
7:"Add[]r2,
8:[] r3"

But what I want is this:

1:ADD
2:[]
3:r2,
4:[]
5:r3
6:[]
7:"Add[]r2,[]r3"

Is it possible to do this with a regex? Or do I need to do something more complex? What I'm trying to do is basically make a regex to split up code syntax. I just need a way to split up a line like I have described.

Any help or suggestions would be greatly appreciated.

EDIT: Example drive code of what I'm trying to do

 String line = "ADD r2, r3 \"ADD r2, r3\"";
        String[] arrLine = line.substring(0, line.length()).split("(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)((?<=\\s)|(?=\\s+))|((?<=,))|(?=\\.)");

        for(int i = 0; i < arrLine.length; i++) {
            System.out.println(arrLine[i]);
        }

Upvotes: 1

Views: 798

Answers (2)

Jonathan Locke
Jonathan Locke

Reputation: 303

When I see a problem like this in general I immediately think to break it down into two or more simpler problems. The other thing that occurs to me is that your problem may get more complicated. It might be worth thinking about ANTLR here.

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163287

Instead of using split, you can match either from an opening till closing double quote, or match whitespace characters, or match all characters except whitespaces and double quotes.

In Java you can use \h to match a horizontal whitespace char, or use \s to match a whitespace char that could also match a newline.

"[^"]*"|\h+|[^\h"]+

Regex demo | Java demo

In Java

String regex = "\"[^\"]*\"|\\h+|[^\\h\"]+";
String string = "ADD r2, r3 \"ADD r2, r3\"";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Output

ADD
 
r2,
 
r3
 
"ADD r2, r3"

Upvotes: 1

Related Questions