Reputation: 1822
I'm new to regular expression, I'm trying to use it to parse tokens separated by "(", ")" and blank space. This is my attempt:
String str = "(test (_bit1 _bit2 |bit3::&92;test#4|))";
String[] tokens = str.split("[\\s*[()]]");
for(int i = 0; i < tokens.length; i++)
System.out.println(i + " : " + tokens[i]);
I expect the following output:
0 : test
1 : _bit1
2 : _bit2
3 : |bit3::&92;test#4|
However, there are two empty tokens appear in the actual output:
0 :
1 : test
2 :
3 : _bit1
4 : _bit2
5 : |bit3::&92;test#4|
I don't understand why I have two empty tokens in position 0 and 2. Could anyone give me a hint? Thank you.
===== Update ====
There was an answer of Alan Moore who deleted it. But I like the answer, so I copy it here for my own reference.
Your regex, [\s*[()]], matches one whitespace character (\s) or one of the characters *, (, or ). The delimiter at the beginning of the string (() is why you get the empty first token. There's no way around that; you just have to check for an empty first token and ignore it.
The second empty token is between the first space and the ( that follows it. That one's on you, because you used * (zero or more) instead of + (one or more). But fixing it isn't that simple. You want to split on spaces, parens, or both, but you have to make sure there's at least one character, whichever it is. This might do it:
\s*[()]+\s*|\s+
But you probably should allow for spaces between parens, too:
\s*(?:[()]+\s*)+|\s+
As a Java string literal, that would be:
\s*(?:[()]+\s*)+|\s+
Upvotes: 1
Views: 1412
Reputation: 11911
Index 0 is the token before the first (
. Index 2 is the token between the space and the second (
in your input string.
I don't think you can avoid the first one, but you can suppress the second by using
str.split("[\\s()]+");
Upvotes: 0
Reputation: 10342
Your regexp is wrong, try this:
String[] tokens = str.split("[\s(\)]+");
String[] tokens = str.split("[\\s()]+"); //At least one character
UPDATE: I've noticed your code actually removes parentheses, so it seems you don't have to escape them between brackets ... not sure why, anyone can answer that?
NEW UPDATE: Thanks @AlanMoore for the explanation, as I understand parentheses within []
aren't needed to be escaped.
Upvotes: 3
Reputation: 444
The problem that you are running into is that it is creating an empty string still between delimiters and then returning it once it hits a delimiter.
You can see what I'm talking about by adding an extra ( like this:
String str = "(test (_bit1 (_bit2 |bit3::&92;test#4|))";
The output will then become:
0 :
1 : test
2 :
3 : _bit1
4 :
5 : _bit2
6 : |bit3::&92;test#4|
I would recommend the following code:
String str = "(test (_bit1 (_bit2 |bit3::&92;test#4|))";
String[] tokensArray = str.split("[\\s[()]*]");
ArrayList<String> tokens = new ArrayList<>();
for (String token : tokensArray) {
if (!token.isEmpty()) {
tokens.add(token);
}
}
for (int i = 0; i < tokens.size(); i++)
System.out.println(i + " : " + tokens.get(i));
What this does is remove any empty tokens from the array, since those are considered "improper" tokens.
Upvotes: 1
Reputation: 39355
My suggestion will be, first remove the splitting characters from the both ends(to avoid empty string), and then do the splitting.
String[] tokens = str.replaceAll("^[\\s()]+|[\\s()]+$", "").split("[\\s()]+");
-- replace leading or trailing--
Also, I have placed your splitting characters(white space, (
)
) inside character class []
Upvotes: 2