Reputation: 307
I would like to create a regex so that I can split a string in Java with the following constraints:
Any non-word character, except for:
(a) Characters surrounded by ' '
(b) Any instance of := >= <= <> ..
So that for the following sample string:
print('*'); x := x - 100
I can get the following result in a String[]
:
print
(
'*'
)
;
x
:=
x
-
100
This is the regex I currently have so far:
str.split("\\s+|"+
"(?=[^\\w'][^']*('[^']*'[^']*)*$)|" +
"(?<=[^\\w'])(?=[^']*('[^']*'[^']*)*$)|" +
"(?=('[^']*'[^']*)*$)|" +
"(?<=')(?=[^']*('[^']*'[^']*)*$)");
But this gives me the following result:
print
(
'*'
)
;
x
:
= <!-- This is the problem. Should be above next to the :
x
-
100
UPDATE
I have now learned that it's not possible to achieve this using Regex.
However, I still cannot use any external or frameworks or lexers, and have to use included Java methods, such as StringTokenizer.
Upvotes: 0
Views: 704
Reputation: 159135
Disclaimer: Regex is not a generic parser. If the text you're reading is a complex language, with nested constructs, then you need to use an actual lexer, not a regex. E.g. the code below supports "Characters surrounded by ' '", which is a simple definition, but if the characters can contain escaped '
characters, you'll need a lexer.
Don't use split()
.
Your code will be much easier to read and understand if you use a find()
loop. It'll also perform better.
You write your regex to specify what you want to capture in one iteration of the find()
loop. You can rely on |
to choose the first pattern that matches, so put more specific patterns first.
Pattern p = Pattern.compile("\\s+" + // sequence of whitespace
"|\\w+" + // sequence of word characters
"|'[^']*'" + // Characters surrounded by ' '
"|[:><]=" + // := >= <=
"|<>" + // <>
"|\\.\\." + // ..
"|."); // Any single other character
String input = "print('*'); x := x - 100";
for (Matcher m = p.matcher(input); m.find(); )
System.out.println(m.group());
Output
print
(
'*'
)
;
x
:=
x
-
100
Upvotes: 1