Adam_G
Adam_G

Reputation: 7879

Efficiently Compare Successive Characters in String

I'm doing some text analysis, and need to record the frequencies of character transitions in a String. I have n categories of characters: for the sake of example, isUpperCase(), isNumber(), and isSpace().

Given that there are n categories, there will be n^2 categories of transitions, e.g. "isUpperCase() --> isUpperCase()", "isUpperCase --> isLetter()", "isLetter() --> isUpperCase()", etc.

Given a block of text, I would like to record the number of transitions that took place. I would imagine constructing a Map with the transition types as the Keys, and an Integer as each Value.

For the block of text "TO", the Map would look like [isUpper -> isUpper : 1, isUpper -> isSpace : 1]

The part I cannot figure out, though, is how to construct a Map where, from what I can see, the Key would consist of 2 boolean methods.

Upvotes: 2

Views: 305

Answers (2)

Boris the Spider
Boris the Spider

Reputation: 61168

Create an enum that represents character types - you need a way to get a character type enum given a character. I'm sure there are better ways to do that than what I have done below but that is left as an exercise to the reader.

Next create a method that takes the previous and current characters and concatenates their types into a unique String.

Finally loop over the input string and hey presto.

private static enum CharacterType {

    UPPER {
        @Override
        boolean isA(final char c) {
            return Character.isUpperCase(c);
        }
    },
    LOWER {
        @Override
        boolean isA(final char c) {
            return Character.isLowerCase(c);
        }
    },
    SPACE {
        @Override
        boolean isA(final char c) {
            return Character.isWhitespace(c);
        }
    },
    UNKOWN {
        @Override
        boolean isA(char c) {
            return false;
        }
    };

    abstract boolean isA(final char c);

    public static CharacterType toType(final char c) {
        for (CharacterType type : values()) {
            if (type.isA(c)) {
                return type;
            }
        }
        return UNKOWN;
    }
}

private static String getTransitionType(final CharacterType prev, final CharacterType current) {
    return prev + "_TO_" + current;
}

public static void main(String[] args) {
    final String myString = "AAaaA Aaa  AA";
    final Map<String, Integer> countMap = new TreeMap<String, Integer>() {
        @Override
        public Integer put(final String key, final Integer value) {
            final Integer currentCount = get(key);
            if (currentCount == null) {
                return super.put(key, value);
            }
            return super.put(key, currentCount + value);
        }
    };
    final char[] myStringAsArray = myString.toCharArray();
    CharacterType prev = CharacterType.toType(myStringAsArray[0]);
    for (int i = 1; i < myStringAsArray.length; ++i) {
        final CharacterType current = CharacterType.toType(myStringAsArray[i]);
        countMap.put(getTransitionType(prev, current), 1);
        prev = current;
    }
    for (final Entry<String, Integer> entry : countMap.entrySet()) {
        System.out.println(entry);
    }
}

Output:

LOWER_TO_LOWER=2
LOWER_TO_SPACE=1
LOWER_TO_UPPER=1
SPACE_TO_SPACE=1
SPACE_TO_UPPER=2
UPPER_TO_LOWER=2
UPPER_TO_SPACE=1
UPPER_TO_UPPER=2

Running the method on the content of your question (825 chars) took 9ms.

Upvotes: 4

scharette
scharette

Reputation: 635

If you think most of the transitions will be present, then a 2 dimension Array would work best:

int n = _categories.size();
int[][] _transitionFreq = new int[n][n];

If you think it will be a parse array, then a map will be more efficient in terms of memory usage, but less efficient in terms of performance.

It's a trade-off you'll have to make depending on your data and the number of character types.

Upvotes: 0

Related Questions