Lookup substring in map, case-insensitive

Question

I wrote a small lexer that transforms a charbuffer into a stream of tokens. One type of tokens are identifiers, which can either be "raw" or a keyword. To test for the latter, I have a map populated with all keywords.

Map lookup = new HashMap<>();
lookup.put("RETURN", KEYWORD_RETURN);
[...]

The map is populated with all uppercase strings.

Now, what I get from my input charbuffer is simply an offset and the length, where I can find my identifier (which needn't be in uppercase there).

The obvious solution looks somewhat like this.

bool lookupIdentifier(CharBuffer buffer, int offset, int length, Map lookupTable) {
    int current = buffer.position();
    buffer.rewind();
    String toCheck = buffer.subSequence(offset, offset + length).toString().toUpperCase();
    buffer.position(current);
    return lookupTable.containsKey(toCheck);
}

There are around 50 entries in the map. Is a TreeMap with a case-insensitive comparator a good alternative to the O(1) HashMap lookup?

What I dislike about my approach is that the creation of the toCheck string allocates. Is there a way to reuse the substring in the CharBuffer for lookup?

Holger · Accepted Answer

You can avoid expensive string construction by using a CharBuffer as key type:

Map lookup = new TreeMap<>(Comparator
    .comparingInt(CharBuffer::remaining)
    .thenComparing((cb1,cb2) -> {
        for(int p1 = cb1.position(), p2 = cb2.position(); p1 < cb1.limit(); p1++, p2++) {
            char c1 = cb1.get(p1), c2 = cb2.get(p2);
            if(c1 == c2) continue;
            c1 = Character.toUpperCase(c1);
            c2 = Character.toUpperCase(c2);
            if(c1 != c2) return Integer.compare(c1, c2);
        }
        return 0;
    }));
lookup.put(CharBuffer.wrap("RETURN"), MyType.KEYWORD_RETURN);

boolean lookupIdentifier(
    CharBuffer buffer, int offset, int length, Map lookupTable) {

    int currentPos = buffer.position(), currLimit = buffer.limit();
    buffer.clear().position(offset).limit(offset + length);
    boolean result = lookupTable.containsKey(buffer);
    buffer.clear().position(currentPos).limit(currLimit);
    return result;
}

The comparator uses a cheap length comparison before performing a case insensitive character comparison. This assumes that you stay with keywords like RETURN, which have a simple case mapping.

For a map with 50 keywords, using log₂ comparisons for a lookup might still yield reasonable performance. Mind that each comparison stops at the first mismatch.

You can use hashing with a dedicated wrapper object:

final class LookupKey {
    final CharBuffer cb;
    LookupKey(CharBuffer cb) {
        this.cb = cb;
    }
    @Override public int hashCode() {
        int code = 1;
        for(int p = cb.position(); p < cb.limit(); p++) {
            code = Character.toUpperCase(cb.get(p)) + code * 31;
        }
        return code;
    }
    @Override public boolean equals(Object obj) {
        if(!(obj instanceof LookupKey)) return false;
        final LookupKey other = (LookupKey)obj;
        CharBuffer cb1 = this.cb, cb2 = other.cb;
        if(cb1.remaining() != cb2.remaining()) return false;
        for(int p1 = cb1.position(), p2 = cb2.position(); p1 < cb1.limit(); p1++, p2++) {
            char c1 = cb1.get(p1), c2 = cb2.get(p2);
            if(c1 == c2) continue;
            c1 = Character.toUpperCase(c1);
            c2 = Character.toUpperCase(c2);
            if(c1 != c2) return false;
        }
        return true;
    }
}

Map lookup = new HashMap<>();
lookup.put(new LookupKey(CharBuffer.wrap("RETURN")), MyType.KEYWORD_RETURN);

boolean lookupIdentifier(
    CharBuffer buffer, int offset, int length, Map lookupTable) {

    int currentPos = buffer.position(), currLimit = buffer.limit();
    buffer.clear().position(offset).limit(offset + length);
    boolean result = lookupTable.containsKey(new LookupKey(buffer));
    buffer.clear().position(currentPos).limit(currLimit);
    return result;
}

The construction of a lightweight object like the LookupKey, which unlike String construction doesn’t need to copy character contents, is negligible. But note that hashing, unlike a comparator, has to process all characters upfront, which can turn out to be more expensive than the log₂ comparisons of a small TreeMap.

If these keywords are unlikely to change, an explicit lookup code, i.e. a switch over invariant properties of the key strings, can be even more efficient. E.g. start with switching over the length, if most keywords differ in length, then over a character which is different for most keywords (include case labels for uppercase and lowercase variant). Another alternative is a hierarchical lookup structure for these properties.

Lookup substring in map, case-insensitive

Answers (1)

Related Questions