Reputation: 422
I wrote a small lexer that transforms a charbuffer into a stream of tokens. One type of tokens are identifiers, which can either be "raw" or a keyword. To test for the latter, I have a map populated with all keywords.
Map<String, MyType> lookup = new HashMap<>();
lookup.put("RETURN", KEYWORD_RETURN);
[...]
The map is populated with all uppercase strings.
Now, what I get from my input charbuffer is simply an offset and the length, where I can find my identifier (which needn't be in uppercase there).
The obvious solution looks somewhat like this.
bool lookupIdentifier(CharBuffer buffer, int offset, int length, Map<String, MyType> lookupTable) {
int current = buffer.position();
buffer.rewind();
String toCheck = buffer.subSequence(offset, offset + length).toString().toUpperCase();
buffer.position(current);
return lookupTable.containsKey(toCheck);
}
There are around 50 entries in the map. Is a TreeMap
with a case-insensitive comparator a good alternative to the O(1) HashMap
lookup?
What I dislike about my approach is that the creation of the toCheck
string allocates. Is there a way to reuse the substring in the CharBuffer
for lookup?
Upvotes: 1
Views: 155
Reputation: 298143
You can avoid expensive string construction by using a CharBuffer
as key type:
Map<CharBuffer, MyType> lookup = new TreeMap<>(Comparator
.comparingInt(CharBuffer::remaining)
.thenComparing((cb1,cb2) -> {
for(int p1 = cb1.position(), p2 = cb2.position(); p1 < cb1.limit(); p1++, p2++) {
char c1 = cb1.get(p1), c2 = cb2.get(p2);
if(c1 == c2) continue;
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if(c1 != c2) return Integer.compare(c1, c2);
}
return 0;
}));
lookup.put(CharBuffer.wrap("RETURN"), MyType.KEYWORD_RETURN);
boolean lookupIdentifier(
CharBuffer buffer, int offset, int length, Map<CharBuffer, MyType> lookupTable) {
int currentPos = buffer.position(), currLimit = buffer.limit();
buffer.clear().position(offset).limit(offset + length);
boolean result = lookupTable.containsKey(buffer);
buffer.clear().position(currentPos).limit(currLimit);
return result;
}
The comparator uses a cheap length comparison before performing a case insensitive character comparison. This assumes that you stay with keywords like RETURN
, which have a simple case mapping.
For a map with 50 keywords, using log₂ comparisons for a lookup might still yield reasonable performance. Mind that each comparison stops at the first mismatch.
You can use hashing with a dedicated wrapper object:
final class LookupKey {
final CharBuffer cb;
LookupKey(CharBuffer cb) {
this.cb = cb;
}
@Override public int hashCode() {
int code = 1;
for(int p = cb.position(); p < cb.limit(); p++) {
code = Character.toUpperCase(cb.get(p)) + code * 31;
}
return code;
}
@Override public boolean equals(Object obj) {
if(!(obj instanceof LookupKey)) return false;
final LookupKey other = (LookupKey)obj;
CharBuffer cb1 = this.cb, cb2 = other.cb;
if(cb1.remaining() != cb2.remaining()) return false;
for(int p1 = cb1.position(), p2 = cb2.position(); p1 < cb1.limit(); p1++, p2++) {
char c1 = cb1.get(p1), c2 = cb2.get(p2);
if(c1 == c2) continue;
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if(c1 != c2) return false;
}
return true;
}
}
Map<LookupKey, MyType> lookup = new HashMap<>();
lookup.put(new LookupKey(CharBuffer.wrap("RETURN")), MyType.KEYWORD_RETURN);
boolean lookupIdentifier(
CharBuffer buffer, int offset, int length, Map<LookupKey, MyType> lookupTable) {
int currentPos = buffer.position(), currLimit = buffer.limit();
buffer.clear().position(offset).limit(offset + length);
boolean result = lookupTable.containsKey(new LookupKey(buffer));
buffer.clear().position(currentPos).limit(currLimit);
return result;
}
The construction of a lightweight object like the LookupKey
, which unlike String
construction doesn’t need to copy character contents, is negligible. But note that hashing, unlike a comparator, has to process all characters upfront, which can turn out to be more expensive than the log₂ comparisons of a small TreeMap
.
If these keywords are unlikely to change, an explicit lookup code, i.e. a switch
over invariant properties of the key strings, can be even more efficient. E.g. start with switching over the length
, if most keywords differ in length, then over a character which is different for most keywords (include case
labels for uppercase and lowercase variant). Another alternative is a hierarchical lookup structure for these properties.
Upvotes: 5