Build Succeeded
Build Succeeded

Reputation: 1150

Can std::hash<std::string> return the same value for different strings?

Below link it is mentioned chances of collision but I am trying to use it for finding duplicate entry:

http://www.cplusplus.com/reference/functional/hash/

I am using std::hash<std::string> and storing the return value in std::unordered_set. if emplace is fails, I am marking string as it is duplicate string.

Upvotes: 4

Views: 3651

Answers (4)

Zan Lynx
Zan Lynx

Reputation: 54325

Yes it can return the same result for different strings. This is a natural consequence of reducing an infinite range of possibilities to a single 64-bit number.

There exist things called "perfect hash functions" which produce a hash function that will return unique results. However, this is only guaranteed for a known set of inputs. An unknown input from outside might produce a matching hash number. That possibility can be reduced by using a bloom filter.

However, at some point with all these hash calculations the program would have been better off doing simple string comparisons in an unsorted linear array. Who cares if the operation is O(1)+C if C is ridiculously big.

Upvotes: 3

ProXicT
ProXicT

Reputation: 1933

Yes, it is possible that two different strings will share the same hash. Simply put, let's imagine you have a hash stored in an 8bit type (unsigned char). That is 2^8 = 256 possible values. That means you can only have 256 unique hashes of arbitrary inputs.
Since you can definitely create more than 256 different strings, there is no way the hash would be unique for all possible strings.

std::size_t is a 64bit type, so if you used this as a storage for the hash value, you'd have 2^64 possible hashes, which is marginally more than 256 possible unique hashes, but it's still not enough to differentiate between all the possible strings you can create.

You just can't store an entire book in only 64 bits.

Upvotes: 3

einpoklum
einpoklum

Reputation: 131547

Hashes are generally functions from a large space of values into a small space of values, e.g. from the space of all strings to 64-bit integers. There are a lot more strings than 64-bit integers, so obviously multiple strings can have the same hash. A good hash function is such that there's no simple rule relating strings with the same hash value.

So, when we want to use hashes to find duplicate strings (or duplicate anything), it's always a two-phase process (at least):

  1. Look for strings with identical hash (i.e. locate the "hash bucket" for your string)
  2. Do a character-by-character comparison of your string with other strings having the same hash.

std::unordered_set does this - and never mind the specifics. Note that it does this for you, so it's redundant for you to hash yourself, then store the result in an std::unordered_set.

Finally, note that there are other features one could use for initial duplicate screening - or for searching among the same-hash values. For example, string length: Before comparing two strings character-by-character, you check their lengths (which you should be able to access without actually iterating the strings); different lengths -> non-equal strings.

Upvotes: 4

Build Succeeded
Build Succeeded

Reputation: 1150

Yes, std::hash return same result for different std::string. The creation of buckets is different by different compiler.

Compiler based implementation found at link: hashing and rehashing for std::unordered_set

Upvotes: 0

Related Questions