Reputation: 8663
Recently, I came across a piece of code, where Map<Integer, String>
is used, where Integer
(key) is hashCode
of some string and String
value corresponding to that.
Is this right thing to do? Because now, equals
will not be called for the String
when calling get
. (get
is also done using hashCode()
method on String object.
Or, hashCode(s) are unique for unique Strings?
I checked equals
od String
class. There is logic written for that. I am confused.
Upvotes: 10
Views: 9313
Reputation: 94058
This is not correct, as hashCode()
will have collisions. As hashCode()
is not well defined, it may have collisions for any pair of strings. So you cannot use that. If you require a unique pointer to a string then you can use a cryptographic hash as key:
The following code shows how to do this:
/**
* Immutable class that represents a unique key for a string. This unique key
* can be used as a key in a hash map, without the likelihood of a collision:
* generating the same key for a different String. Note that you should first
* check if keeping a key to a reference of a string is feasible, in that case a
* {@link Set} may suffice.
* <P>
* This class utilizes SHA-512 to generate the keys and uses
* {@link StandardCharsets#UTF_8} for the encoding of the strings. If a smaller
* output size than 512 bits (64 bytes) is required then the leftmost bytes of
* the SHA-512 hash are used. Smaller keys are therefore contained in with
* larger keys over the same String value.
* <P>
* Note that it is not impossible to create collisions for key sizes up to 8-20
* bytes.
*
* @author owlstead
*/
public final class UniqueKey implements Serializable {
public static final int MIN_DIGEST_SIZE_BYTES = 8;
public static final int MAX_DIGEST_SIZE_BYTES = 64;
/**
* Creates a unique key for a string with the maximum size of 64 bytes.
*
* @param input
* the input, not null
* @return the generated instance
*/
public static UniqueKey createUniqueKey(final CharSequence input) {
return doCreateUniqueKey(input, MAX_DIGEST_SIZE_BYTES);
}
/**
* Creates a unique key for a string with a size of 8 to 64 bytes.
*
* @param input
* the input, not null
* @param outputSizeBytes
* the output size
* @return the generated instance
*/
public static UniqueKey createUniqueKey(final CharSequence input,
final int outputSizeBytes) {
return doCreateUniqueKey(input, outputSizeBytes);
}
@Override
public boolean equals(final Object obj) {
if (!(obj instanceof UniqueKey)) {
return false;
}
final UniqueKey that = (UniqueKey) obj;
return ByteBuffer.wrap(this.key).equals(ByteBuffer.wrap(that.key));
}
@Override
public int hashCode() {
return ByteBuffer.wrap(this.key).hashCode();
}
/**
* Outputs an - in itself - unique String representation of this key.
*
* @return the string <CODE>"{key: [HEX ENCODED KEY]}"</CODE>
*/
@Override
public String toString() {
// non-optimal but readable conversion to hexadecimal
final StringBuilder sb = new StringBuilder(this.key.length * 2);
sb.append("{Key: ");
for (int i = 0; i < this.key.length; i++) {
sb.append(String.format("%02X", this.key[i]));
}
sb.append("}");
return sb.toString();
}
/**
* Makes it possible to retrieve the underlying key data (e.g. to use a
* different encoding).
*
* @return the data in a read only ByteBuffer
*/
public ByteBuffer asReadOnlyByteBuffer() {
return ByteBuffer.wrap(this.key).asReadOnlyBuffer();
}
private static final long serialVersionUID = 1L;
private static final int BUFFER_SIZE = 512;
// byte array instead of ByteBuffer to support serialization
private final byte[] key;
private static UniqueKey doCreateUniqueKey(final CharSequence input,
final int outputSizeBytes) {
// --- setup digest
final MessageDigest digestAlgorithm;
try {
// note: relatively fast on 64 bit systems (faster than SHA-256!)
digestAlgorithm = MessageDigest.getInstance("SHA-512");
} catch (final NoSuchAlgorithmException e) {
throw new IllegalStateException(
"SHA-256 should always be avialable in a Java RE");
}
// --- validate input parameters
if (outputSizeBytes < MIN_DIGEST_SIZE_BYTES
|| outputSizeBytes > MAX_DIGEST_SIZE_BYTES) {
throw new IllegalArgumentException(
"Unique key size either too small or too big");
}
// --- setup loop
final CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
final CharBuffer buffer = CharBuffer.wrap(input);
final ByteBuffer encodedBuffer = ByteBuffer.allocate(BUFFER_SIZE);
CoderResult coderResult;
// --- loop over all characters
// (instead of encoding everything to byte[] at once - peak memory!)
while (buffer.hasRemaining()) {
coderResult = encoder.encode(buffer, encodedBuffer, false);
if (coderResult.isError()) {
throw new IllegalArgumentException(
"Invalid code point in input string");
}
encodedBuffer.flip();
digestAlgorithm.update(encodedBuffer);
encodedBuffer.clear();
}
coderResult = encoder.encode(buffer, encodedBuffer, true);
if (coderResult.isError()) {
throw new IllegalArgumentException(
"Invalid code point in input string");
}
encodedBuffer.flip();
digestAlgorithm.update(encodedBuffer);
// no need to clear encodedBuffer if generated locally
// --- resize result if required
final byte[] digest = digestAlgorithm.digest();
final byte[] result;
if (outputSizeBytes == digest.length) {
result = digest;
} else {
result = Arrays.copyOf(digest, outputSizeBytes);
}
// --- and return the final, possibly resized, result
return new UniqueKey(result);
}
private UniqueKey(final byte[] key) {
this.key = key;
}
}
Upvotes: 1
Reputation: 101918
By doing it's own hashing of the key String, that code risks the chance that two different key strings will generate the same integer map key and the code will fail in some situations.
In general, the code should probably be using Map<String,String>
.
However it is possible that the author is doing that for a good (ie deliberate) reason and it is not a bug. We can't tell without seeing the code.
Upvotes: 0
Reputation: 692131
A HashMap does use equals()
to compare keys. It only uses hashCode()
to find the bucket where the key is located, and thus drastically reduce the number of keys to compare with equals()
.
Obviously, hashCode()
can't produce unique values, since int is limited to 2^32 distinct values, and there are an infinity of possible String values.
In conclusion, the result of hashCode()
is not a good fit for a key of a Map
.
Upvotes: 19
Reputation: 28786
A hashCode is to increase the performance of collections. For example with an Set, which contains unique items. When adding a new item, the hashCode will be used to detect if the item already exists in the Set.
In the event of a collision, we'll fall through to a (usually) slower equals comparison.
Therefore, its important for performance that a hashCode should return a distinct value. Of much greater importance is that an object's hashCode is consistent across invocations, given the same internal state.
Therefore it is not wise use hashCode as keys in a map! They are not unique.
Upvotes: 0
Reputation: 136112
No, they are not unique, eg "FB" and "Ea" both have hashCode = 2236
Upvotes: 10