Reputation: 2062
I have a String
a-b-c
Then I want to tokenize the string by character '-', the result would be
[a, b, c]
But then I have a String
a---c
The result should be
[a, -, c]
Is there already a tokenizer in Java which can do this?
Upvotes: 0
Views: 393
Reputation: 9201
This is a solution using only regexps to give you the needed result for your test data:
\b-|-\b
The word boundary (\b
) possibilities are often underestimated but can simplify many regexps dramatically.
With the provided regexp you can now use Javas split
method. So the little testclass could look like:
public class SimpleRegExp {
public static void main(String[] args) {
String regexp = "\\b-|-\\b";
System.out.println(Arrays.toString("a-b-c".split(regexp)));
System.out.println(Arrays.toString("a---c".split(regexp)));
}
}
and prints this result:
[a, b, c]
[a, -, c]
Upvotes: 1
Reputation: 69399
I'm going to assume your delimiter is always one hyphen, and that ---
would split to [-,-]
. And that ----
would either be invalid or split to [-,-]
. In which case, the following would work for you:
private static List<String> tokenize(String input, char delimeter) {
List<String> result = new ArrayList<String>();
StringBuilder builder = new StringBuilder();
for (char c : input.toCharArray()) {
if (builder.length() == 0) {
builder.append(c);
} else if (c == delimeter) {
result.add(builder.toString());
builder.setLength(0);
} else {
builder.append(c);
}
}
if (builder.length() > 0) {
result.add(builder.toString());
}
return result;
}
Test code:
public static void main(String[] args) throws Exception {
String s1 = "a-b-c";
String s2 = "a---c";
System.out.println(Arrays.toString(tokenize(s1, '-').toArray()));
System.out.println(Arrays.toString(tokenize(s2, '-').toArray()));
}
Prints:
[a, b, c] [a, -, c]
Upvotes: 0
Reputation: 31300
This (first try) appears to handle your samples as requested.
String rex = "(?<=-)-(?=\\w)|(?<=\\w)-(?=-)|(?<=\\w)-(?=\\w)";
String[] t1 = s1.split( rex );
Is a \w a correct assumption in contrast to '-'? Otherwise this should be changed.
Also, I think it can be condensed somewhat.
Upvotes: 0