gexicide
gexicide

Reputation: 40048

How to implement efficient hash cons with java HashSet

I am trying to implement a hash cons in java, comparable to what String.intern does for strings. I.e., I want a class to store all distinct values of a data type T in a set and provide an T intern(T t) method that checks whether t is already in the set. If so, the instance in the set is returned, otherwise t is added to the set and returned. The reason is that the resulting values can be compared using reference equality since two equal values returned from intern will for sure also be the same instance.

Of course, the most obvious candidate data structure for a hash cons is java.util.HashSet<T>. However, it seems that its interface is flawed and does not allow efficient insertion, because there is no method to retrieve an element that is already in the set or insert one if it is not in there.

An algorithm using HashSet would look like this:

class HashCons<T>{
    HashSet<T> set = new HashSet<>();

    public T intern(T t){
        if(set.contains(t)) {
           return ???;  // <----- PROBLEM
        } else {
           set.add(t); // <--- Inefficient, second hash lookup
           return t;
    }
}

As you see, the problem is twofold:

  1. This solution would be inefficient since I would access the hash table twice, once for contains and once for add. But okay, this may not be a too big performance hit since the correct bucket will be in the cache after the contains, so add will not trigger a cache miss and thus be quite fast.
  2. I cannot retrieve an element already in the set (see line flagged PROBLEM). There is just no method to retrieve the element in the set. So it is just not possible to implement this.

Am I missing something here? Or is it really impossible to build a usual hash cons with java.util.HashSet?

Upvotes: 1

Views: 299

Answers (2)

leventov
leventov

Reputation: 15263

Well HashSet is implemented as HashMap wrapper in OpenJDK, so you won't win in memory usage comparing to solution suggested by aRestless.

10-min sketch

class HashCons<T> {
    T[] table;
    int size;
    int sizeLimit;
    HashCons(int expectedSize) {
        init(Math.max(Integer.highestOneBit(expectedSize * 2) * 2, 16));
    }

    private void init(int capacity) {
        table = (T[]) new Object[capacity];
        size = 0;
        sizeLimit = (int) (capacity * 2L / 3);
    }

    T cons(@Nonnull T key) {
        int mask = table.length - 1;
        int i = key.hashCode() & mask;
        do {
            if (table[i] == null) break;
            if (key.equals(table[i])) return table[i];
            i = (i + 1) & mask;
        } while (true);
        table[i] = key;
        if (++size > sizeLimit) rehash();
        return key;
    }

    private void rehash() {
        T[] table = this.table;
        if (table.length == (1 << 30))
            throw new IllegalStateException("HashCons is full");
        init(table.length << 1);
        for (T key : table) {
            if (key != null) cons(key);
        }
    }
}

Upvotes: 1

aRestless
aRestless

Reputation: 1885

I don't think it's possible using HashSet. You could use some kind of Map instead and use your value as key and as value. The java.util.concurrent.ConcurrentMap also happens to posess the quite convenient method

putIfAbsent(K key, V value)

that returns the value if it is already existent. However, I don't know about the performance of this method (compared to checking "manually" on non-concurrent implementations of Map).

Here is how you would do it using a HashMap:

class HashCons<T>{
    Map<T,T> map = new HashMap<T,T>();

    public T intern(T t){
        if (!map.containsKey(t))
            map.put(t,t);
        return map.get(t);
    }
}

I think the reason why it is not possible with HashSet is quite simple: To the set, if contains(t) is fulfilled, it means that the given t also equals one of the t' in the set. There is no reason for being able return it (as you already have it).

Upvotes: 1

Related Questions