Dark Knight
Dark Knight

Reputation: 307

Java Regex: Splitting based on multiple conditions with exceptions

I would like to create a regex so that I can split a string in Java with the following constraints:

Any non-word character, except for:
 (a) Characters surrounded by ' '
 (b) Any instance of    :=   >=   <=   <>   ..

So that for the following sample string:

print('*');  x := x - 100

I can get the following result in a String[]:

print
(
'*'
)
;

x

:=

x

-

100

This is the regex I currently have so far:

str.split("\\s+|"+
          "(?=[^\\w'][^']*('[^']*'[^']*)*$)|" +
          "(?<=[^\\w'])(?=[^']*('[^']*'[^']*)*$)|" +
          "(?=('[^']*'[^']*)*$)|" +
          "(?<=')(?=[^']*('[^']*'[^']*)*$)");

But this gives me the following result:

print
(
'*'
)
;

x

:    
=    <!-- This is the problem. Should be above next to the :

x

-

100

UPDATE

I have now learned that it's not possible to achieve this using Regex.

However, I still cannot use any external or frameworks or lexers, and have to use included Java methods, such as StringTokenizer.

Upvotes: 0

Views: 704

Answers (1)

Andreas
Andreas

Reputation: 159135

Disclaimer: Regex is not a generic parser. If the text you're reading is a complex language, with nested constructs, then you need to use an actual lexer, not a regex. E.g. the code below supports "Characters surrounded by ' '", which is a simple definition, but if the characters can contain escaped ' characters, you'll need a lexer.

Don't use split().

Your code will be much easier to read and understand if you use a find() loop. It'll also perform better.

You write your regex to specify what you want to capture in one iteration of the find() loop. You can rely on | to choose the first pattern that matches, so put more specific patterns first.

Pattern p = Pattern.compile("\\s+" +    // sequence of whitespace
                           "|\\w+" +    // sequence of word characters
                           "|'[^']*'" + // Characters surrounded by ' '
                           "|[:><]=" +  // :=   >=   <=
                           "|<>" +      // <>
                           "|\\.\\." +  // ..
                           "|.");       // Any single other character
String input = "print('*');  x := x - 100";
for (Matcher m = p.matcher(input); m.find(); )
    System.out.println(m.group());

Output

print
(
'*'
)
;

x

:=

x

-

100

Upvotes: 1

Related Questions