matthias
matthias

Reputation: 2062

Tokenize a String in Java

I have a String

a-b-c

Then I want to tokenize the string by character '-', the result would be

[a, b, c]

But then I have a String

a---c

The result should be

[a, -, c]

Is there already a tokenizer in Java which can do this?

Upvotes: 0

Views: 393

Answers (3)

wumpz
wumpz

Reputation: 9201

This is a solution using only regexps to give you the needed result for your test data:

\b-|-\b

Regular expression visualization

Debuggex Demo

The word boundary (\b) possibilities are often underestimated but can simplify many regexps dramatically.

With the provided regexp you can now use Javas split method. So the little testclass could look like:

public class SimpleRegExp {
    public static void main(String[] args) {
        String regexp = "\\b-|-\\b";
        System.out.println(Arrays.toString("a-b-c".split(regexp)));
        System.out.println(Arrays.toString("a---c".split(regexp)));
    }
}

and prints this result:

[a, b, c]
[a, -, c]

Upvotes: 1

Duncan Jones
Duncan Jones

Reputation: 69399

I'm going to assume your delimiter is always one hyphen, and that --- would split to [-,-]. And that ---- would either be invalid or split to [-,-]. In which case, the following would work for you:

private static List<String> tokenize(String input, char delimeter) {
    List<String> result = new ArrayList<String>();
    StringBuilder builder = new StringBuilder();

    for (char c : input.toCharArray()) {
        if (builder.length() == 0) {
            builder.append(c);
        } else if (c == delimeter) {
            result.add(builder.toString());
            builder.setLength(0);
        } else {
            builder.append(c);
        }
    }

    if (builder.length() > 0) {
        result.add(builder.toString());
    }

    return result;
}

Test code:

public static void main(String[] args) throws Exception {
    String s1 = "a-b-c";
    String s2 = "a---c";

    System.out.println(Arrays.toString(tokenize(s1, '-').toArray()));
    System.out.println(Arrays.toString(tokenize(s2, '-').toArray()));
}

Prints:

[a, b, c]
[a, -, c]

Upvotes: 0

laune
laune

Reputation: 31300

This (first try) appears to handle your samples as requested.

String rex = "(?<=-)-(?=\\w)|(?<=\\w)-(?=-)|(?<=\\w)-(?=\\w)";
String[] t1 = s1.split( rex );

Is a \w a correct assumption in contrast to '-'? Otherwise this should be changed.

Also, I think it can be condensed somewhat.

Upvotes: 0

Related Questions