Reputation: 7879
I'm doing some text analysis, and need to record the frequencies of character transitions in a String
. I have n categories of characters: for the sake of example, isUpperCase()
, isNumber()
, and isSpace()
.
Given that there are n categories, there will be n^2 categories of transitions, e.g. "isUpperCase()
--> isUpperCase()
", "isUpperCase
--> isLetter()
", "isLetter()
--> isUpperCase()
", etc.
Given a block of text, I would like to record the number of transitions that took place. I would imagine constructing a Map
with the transition types as the Keys
, and an Integer
as each Value
.
For the block of text "TO
", the Map
would look like [isUpper -> isUpper : 1, isUpper -> isSpace : 1]
The part I cannot figure out, though, is how to construct a Map
where, from what I can see, the Key
would consist of 2 boolean
methods.
Upvotes: 2
Views: 305
Reputation: 61168
Create an enum
that represents character types - you need a way to get a character type enum
given a character. I'm sure there are better ways to do that than what I have done below but that is left as an exercise to the reader.
Next create a method that takes the previous and current characters and concatenates their types into a unique String
.
Finally loop over the input string and hey presto.
private static enum CharacterType {
UPPER {
@Override
boolean isA(final char c) {
return Character.isUpperCase(c);
}
},
LOWER {
@Override
boolean isA(final char c) {
return Character.isLowerCase(c);
}
},
SPACE {
@Override
boolean isA(final char c) {
return Character.isWhitespace(c);
}
},
UNKOWN {
@Override
boolean isA(char c) {
return false;
}
};
abstract boolean isA(final char c);
public static CharacterType toType(final char c) {
for (CharacterType type : values()) {
if (type.isA(c)) {
return type;
}
}
return UNKOWN;
}
}
private static String getTransitionType(final CharacterType prev, final CharacterType current) {
return prev + "_TO_" + current;
}
public static void main(String[] args) {
final String myString = "AAaaA Aaa AA";
final Map<String, Integer> countMap = new TreeMap<String, Integer>() {
@Override
public Integer put(final String key, final Integer value) {
final Integer currentCount = get(key);
if (currentCount == null) {
return super.put(key, value);
}
return super.put(key, currentCount + value);
}
};
final char[] myStringAsArray = myString.toCharArray();
CharacterType prev = CharacterType.toType(myStringAsArray[0]);
for (int i = 1; i < myStringAsArray.length; ++i) {
final CharacterType current = CharacterType.toType(myStringAsArray[i]);
countMap.put(getTransitionType(prev, current), 1);
prev = current;
}
for (final Entry<String, Integer> entry : countMap.entrySet()) {
System.out.println(entry);
}
}
Output:
LOWER_TO_LOWER=2
LOWER_TO_SPACE=1
LOWER_TO_UPPER=1
SPACE_TO_SPACE=1
SPACE_TO_UPPER=2
UPPER_TO_LOWER=2
UPPER_TO_SPACE=1
UPPER_TO_UPPER=2
Running the method on the content of your question (825 chars) took 9ms.
Upvotes: 4
Reputation: 635
If you think most of the transitions will be present, then a 2 dimension Array would work best:
int n = _categories.size();
int[][] _transitionFreq = new int[n][n];
If you think it will be a parse array, then a map will be more efficient in terms of memory usage, but less efficient in terms of performance.
It's a trade-off you'll have to make depending on your data and the number of character types.
Upvotes: 0