Reputation: 40048
I am trying to implement a hash cons in java, comparable to what String.intern
does for strings. I.e., I want a class to store all distinct values of a data type T
in a set and provide an T intern(T t)
method that checks whether t
is already in the set. If so, the instance in the set is returned, otherwise t
is added to the set and returned. The reason is that the resulting values can be compared using reference equality since two equal values returned from intern
will for sure also be the same instance.
Of course, the most obvious candidate data structure for a hash cons is java.util.HashSet<T>
. However, it seems that its interface is flawed and does not allow efficient insertion, because there is no method to retrieve an element that is already in the set or insert one if it is not in there.
An algorithm using HashSet
would look like this:
class HashCons<T>{
HashSet<T> set = new HashSet<>();
public T intern(T t){
if(set.contains(t)) {
return ???; // <----- PROBLEM
} else {
set.add(t); // <--- Inefficient, second hash lookup
return t;
}
}
As you see, the problem is twofold:
contains
and once for add
. But okay, this may not be a too big performance hit since the correct bucket will be in the cache after the contains
, so add
will not trigger a cache miss and thus be quite fast.PROBLEM
). There is just no method to retrieve the element in the set. So it is just not possible to implement this.Am I missing something here? Or is it really impossible to build a usual hash cons with java.util.HashSet
?
Upvotes: 1
Views: 299
Reputation: 15263
Well HashSet
is implemented as HashMap
wrapper in OpenJDK, so you won't win in memory usage comparing to solution suggested by aRestless.
10-min sketch
class HashCons<T> {
T[] table;
int size;
int sizeLimit;
HashCons(int expectedSize) {
init(Math.max(Integer.highestOneBit(expectedSize * 2) * 2, 16));
}
private void init(int capacity) {
table = (T[]) new Object[capacity];
size = 0;
sizeLimit = (int) (capacity * 2L / 3);
}
T cons(@Nonnull T key) {
int mask = table.length - 1;
int i = key.hashCode() & mask;
do {
if (table[i] == null) break;
if (key.equals(table[i])) return table[i];
i = (i + 1) & mask;
} while (true);
table[i] = key;
if (++size > sizeLimit) rehash();
return key;
}
private void rehash() {
T[] table = this.table;
if (table.length == (1 << 30))
throw new IllegalStateException("HashCons is full");
init(table.length << 1);
for (T key : table) {
if (key != null) cons(key);
}
}
}
Upvotes: 1
Reputation: 1885
I don't think it's possible using HashSet
. You could use some kind of Map
instead and use your value as key and as value. The java.util.concurrent.ConcurrentMap
also happens to posess the quite convenient method
putIfAbsent(K key, V value)
that returns the value if it is already existent. However, I don't know about the performance of this method (compared to checking "manually" on non-concurrent implementations of Map
).
Here is how you would do it using a HashMap
:
class HashCons<T>{
Map<T,T> map = new HashMap<T,T>();
public T intern(T t){
if (!map.containsKey(t))
map.put(t,t);
return map.get(t);
}
}
I think the reason why it is not possible with HashSet
is quite simple: To the set, if contains(t)
is fulfilled, it means that the given t
also equals one of the t' in the set. There is no reason for being able return it (as you already have it).
Upvotes: 1