Reputation: 1292
I try to split a String into tokens.
The token delimiters are not single characters, some delimiters are included into others (example, & and &&), and I need to have the delimiters returned as token.
StringTokenizer is not able to deal with multiple characters delimiters. I presume it's possible with String.split, but fail to guess the magical regular expression that will suits my needs.
Any idea ?
Example:
Token delimiters: "&", "&&", "=", "=>", " "
String to tokenize: a & b&&c=>d
Expected result: an string array containing "a", " ", "&", " ", "b", "&&", "c", "=>", "d"
--- Edit ---
Thanks to all for your help, Dasblinkenlight gives me the solution. Here is the "ready to use" code I wrote with his help:
private static String[] wonderfulTokenizer(String string, String[] delimiters) {
// First, create a regular expression that matches the union of the delimiters
// Be aware that, in case of delimiters containing others (example && and &),
// the longer may be before the shorter (&& should be before &) or the regexpr
// parser will recognize && as two &.
Arrays.sort(delimiters, new Comparator<String>() {
@Override
public int compare(String o1, String o2) {
return -o1.compareTo(o2);
}
});
// Build a string that will contain the regular expression
StringBuilder regexpr = new StringBuilder();
regexpr.append('(');
for (String delim : delimiters) { // For each delimiter
if (regexpr.length() != 1) regexpr.append('|'); // Add union separator if needed
for (int i = 0; i < delim.length(); i++) {
// Add an escape character if the character is a regexp reserved char
regexpr.append('\\');
regexpr.append(delim.charAt(i));
}
}
regexpr.append(')'); // Close the union
Pattern p = Pattern.compile(regexpr.toString());
// Now, search for the tokens
List<String> res = new ArrayList<String>();
Matcher m = p.matcher(string);
int pos = 0;
while (m.find()) { // While there's a delimiter in the string
if (pos != m.start()) {
// If there's something between the current and the previous delimiter
// Add it to the tokens list
res.add(string.substring(pos, m.start()));
}
res.add(m.group()); // add the delimiter
pos = m.end(); // Remember end of delimiter
}
if (pos != string.length()) {
// If it remains some characters in the string after last delimiter
// Add this to the token list
res.add(string.substring(pos));
}
// Return the result
return res.toArray(new String[res.size()]);
}
It could be optimize if you have many strings to tokenize by creating the Pattern only one time.
Upvotes: 6
Views: 11667
Reputation: 726499
You can use the Pattern
and a simple loop to achieve the results that you are looking for:
List<String> res = new ArrayList<String>();
Pattern p = Pattern.compile("([&]{1,2}|=>?| +)");
String s = "s=a&=>b";
Matcher m = p.matcher(s);
int pos = 0;
while (m.find()) {
if (pos != m.start()) {
res.add(s.substring(pos, m.start()));
}
res.add(m.group());
pos = m.end();
}
if (pos != s.length()) {
res.add(s.substring(pos));
}
for (String t : res) {
System.out.println("'"+t+"'");
}
This produces the result below:
's'
'='
'a'
'&'
'=>'
'b'
Upvotes: 4
Reputation: 3988
Try this:
String test = "a & b&&c=>d=A";
String regEx = "(&[&]?|=[>]?)";
String[] res = test.split(regEx);
for(String s : res){
System.out.println("Token: "+s);
}
I added the '=A' at the end to show that that is also parsed.
As mentioned in another answer, if you need the atypical behaviour of keeping the delimiters in the result, you will probably need to create you parser yourself....but in that case you really have to think about what a "delimiter" is in your code.
Upvotes: 1
Reputation: 361
Split won't do it for you as it removed the delimeter. You probably need to tokenize the string on your own (i.e. a for-loop) or use a framework like http://www.antlr.org/
Upvotes: 2