digiarnie
digiarnie

Reputation: 23345

Finding tokens in a Java String

Is there a nice way to extract tokens that start with a pre-defined string and end with a pre-defined string?

For example, let's say the starting string is "[" and the ending string is "]". If I have the following string:

"hello[world]this[[is]me"

The output should be:

token[0] = "world"

token[1] = "[is"

(Note: the second token has a 'start' string in it)

Upvotes: 3

Views: 10095

Answers (8)

Babak Naffas
Babak Naffas

Reputation: 12561

The regular expression \\[[\\[\\w]+\\] gives us [world] and [[is]

Upvotes: 0

glmxndr
glmxndr

Reputation: 46566

Here is the way I would go to avoid dependency on commons lang.

public static String escapeRegexp(String regexp){
    String specChars = "\\$.*+?|()[]{}^";
    String result = regexp;
    for (int i=0;i<specChars.length();i++){
        Character curChar = specChars.charAt(i);
        result = result.replaceAll(
            "\\"+curChar,
            "\\\\" + (i<2?"\\":"") + curChar); // \ and $ must have special treatment
    }
    return result;
}

public static List<String> findGroup(String content, String pattern, int group) {
    Pattern p = Pattern.compile(pattern);
    Matcher m = p.matcher(content);
    List<String> result = new ArrayList<String>();
    while (m.find()) {
        result.add(m.group(group));
    }
    return result;
}


public static List<String> tokenize(String content, String firstToken, String lastToken){
    String regexp = lastToken.length()>1
                    ?escapeRegexp(firstToken) + "(.*?)"+ escapeRegexp(lastToken)
                    :escapeRegexp(firstToken) + "([^"+lastToken+"]*)"+ escapeRegexp(lastToken);
    return findGroup(content, regexp, 1);
}        

Use it like this :

String content = "hello[world]this[[is]me";
List<String> tokens = tokenize(content,"[","]");

Upvotes: 3

Jonathan Holloway
Jonathan Holloway

Reputation: 63672

I think you can use the Apache Commons Lang feature that exists in StringUtils:

substringsBetween(java.lang.String str,
                  java.lang.String open,
                  java.lang.String close)

The API docs say it:

Searches a String for substrings delimited by a start and end tag, returning all matching substrings in an array.

The Commons Lang substringsBetween API can be found here:

http://commons.apache.org/lang/apidocs/org/apache/commons/lang/StringUtils.html#substringsBetween(java.lang.String,%20java.lang.String,%20java.lang.String)

Upvotes: 9

L. Cornelius Dol
L. Cornelius Dol

Reputation: 64026

StringTokenizer won't cut it for the specified behavior. You'll need your own method. Something like:

public List extractTokens(String txt, String str, String end) {
    int                      so=0,eo;
    List                     lst=new ArrayList();

    while(so<txt.length() && (so=txt.indexOf(str,so))!=-1) {
        so+=str.length();
        if(so<txt.length() && (eo=txt.indexOf(end,so))!=-1) {
            lst.add(txt.substring(so,eo);
            so=eo+end.length();
            }
        }
    return lst;
    }

Upvotes: 0

ahawker
ahawker

Reputation: 3364

Try a regular expression like:

(.*?\[(.*?)\])

The second capture should contain all of the information between the set of []. This will however not work properly if the string contains nested [].

Upvotes: 0

mnuzzo
mnuzzo

Reputation: 3577

There's one way you can do this. It isn't particularly pretty. What it involves is going through the string character by character. When you reach a "[", you start putting the characters into a new token. When you reach a "]", you stop. This would be best done using a data structure not an array since arrays are of static length.

Another solution which may be possible, is to use regexes for the String's split split method. The only problem I have is coming up with a regex which would split the way you want it to. What I can come up with is {]string of characters[) XOR (string of characters[) XOR (]string of characters) Each set of parenthesis denotes a different regex. You should evaluate them in this order so you don't accidentally remove anything you want. I'm not familiar with regexes in Java, so I used "string of characters" to denote that there's characters in between the brackets.

Upvotes: 0

Rahul Garg
Rahul Garg

Reputation: 8610

Normal string tokenizer wont work for his requirement but you have to tweak it or write your own.

Upvotes: 0

Charlie Martin
Charlie Martin

Reputation: 112366

StringTokenizer?Set the search string to "[]" and the "include tokens" flag to false and I think you're set.

Upvotes: 0

Related Questions