Reputation: 14390

Does C++ string hashing hash the string or the memory address?

I've never studied hashing algorithms before, and I was surprised that when using std::unordered_map I found out that the hashing function (I think) actually hashes memory addresses, and not strings. Please correct me if I'm wrong, but I found this out by just changing a raw string and adding it to my unordered_map, and when the memory address (pointer) was the same it never added anything.

And in the following case whether a new key is added or not depends on whether std::string reallocates to another area of memory or not:

std::unordered_map<const char*, char*> myMap;

std::string myString = "Key1";

myMap[myString.c_str()] = "someVal";    // <--- Adds a new key, size is now 1
myString = "Key2";
myMap[myString.c_str()] = "someVal";    // <--- Doesn't add a new key "Key2" didn't need to be reallocated

However when using std::string directly in the template when I change the string it DOES add another key to my map, so that would indicate that the unordered_map template is specialised for std::string and actually hashes the string itself? Is this way slower if it has to hash the string itself?

The reason I bring this up is that the tutorials I've seen seem to convey the meaning that it's the actual string itself that gets hashed. Even here on Stack Overflow I've seen people say how (paraphrasing)"the entire string doesn't need to be checked, only as many characters as necessary" for performance reasons.

Well the impression that I got is obviously wrong for string literals and pointers to strings, but not for the std::string class?

Upvotes: 2

Answers (5)

Spock77

Reputation: 3325

Using pointers as keys could be a solution but only for CONSTANT strings - a pointer is the simplest and fastest hash. You can use different const variables to init unordered map, be sure that their lifetime is appropriate.

Upvotes: -2

Christian Hackl

Reputation: 27548

Does C++ string hashing hash the string or the memory address?

This question is really about equality versus identity and depends on what you mean when you say "string".

Equality. If you mean the std::string class, then hashing has nothing to do with memory addresses. The string's actual content is hashed. Two std::string instances are equal and produce the same hash if the contents are equal to each other.
Identity. If you mean a pointer to some characters in memory, then the memory address is hashed, regardless of what data is saved there. Two "strings" are identical and produce the same hash if they point to the same memory location.

When you deal with strings, you almost always want equality comparisons and are encouraged to use std::string, because two different string instances representing the same data should be considered equal even if the data lives at different memory addresses, and std::string always provides those semantics for you, be it with hashing or with simple comparisons like myStr1 == myStr2.^[*]

Hashing char const* or char* is very dangerous, because you run into a lot of implemented-defined behaviour. String literals are the prime example for this. For example, consider the following program:

#include <iostream>

int main()
{
    char const *a = "foo";
    char const *b = "foo";

    std::cout << reinterpret_cast<void const*>(a) << "\n";
    std::cout << reinterpret_cast<void const*>(b) << "\n";
}

The C++ standard does not tell you if the addresses will be identical or not. Compilers usually allow you to control this behaviour. For instance, Visual C++ has the /GF flag for it. If you turn it on, the addresses will be identical; otherwise, they won't.

This has quite drastic consequences for hashing. In the following program, it is implementation-defined whether 1 or 2 will be printed:

#include <iostream>
#include <unordered_map>

int main()
{
    char const *a = "foo";
    char const *b = "foo";

    std::unordered_map<char const*, char*> myMap;
    myMap[a] = "1";
    myMap[b] = "2";

    std::cout << myMap.size() << "\n"; // prints 1 or 2
}

Your code also has implemented-defined behaviour; not because of literals, but in a different way:

And in the following case whether a new key is added or not depends on whether std::string reallocates to another area of memory or not:

Yes. You should never take c_str() pointers from two different std::string instances and assume that the pointers will be identical only because the std::string instances are equal.

Is this way slower if it has to hash the string itself?

No. I challenge you to come up with a realistic use case for which you can actually measure the difference. Only then is it "way" slower. Otherwise, it's plain old premature optimisation.

But there's more to it. Technically, hashing a single address should be faster than using the entire string contents (or large portions of it) to compute a hash value, because more data is involved. That's quite obvious. But I'm not sure you see the necessity of performing the "expensive" computation. There is no magic involved in this. If your program logic cares about the contents of the strings, then the individual characters must be taken into account. How, even in theory, should you be able to hash data you do not read?

Or, more generally, how to hash something you don't have?

^{[*] Coincidentally, failure to consider this distinction is the same source for a very common bug in Java, i.e. str1 == str2 having different semantics than str1.equals(str2).}

Upvotes: 2

WhiZTiM

Reputation: 21576

If you look at the standard basic specializations for std::hash. There is no specialization for const char * because that is simply a pointer to a character array. However, there is a specialization for any pointer type:

template< class T > struct hash<T*>;

which is what is used by std::unordered_map. It simply hashes the address.

Simply using const char* as key to your std::unordered_map with default hash and equality is messy, because the default hash function hashes the address, and the default equality function will compare the address. You should prefer std::string for your key, else you need to do something like:

std::unordered_map<const char*, char*, MyCustomHash, MyCustomEquality> myMap;

Upvotes: 2

Walter

Reputation: 45484

You're mistaken to think that const char* is a string. It actually is a pointer. Hence std::unordered_map<const char*, anything> uses pointers (of type const char*) as key, and the specialization of std::hash for pointers (which hashes the address) as hash key.

If you want to use strings as keys, you should use std::string, e.g. std::unordered_map<std::string, anything>.

edit I should also say that using pointers instead of strings is at least dangerous but often impossible. It will not do what you think. The problem is that the string (character sequence) and its address (pointer) are not necessarily paired up for the lifetime of the program (though this may be true for some const char* objects). Think of the following

std::unordered_map<const char*,int> map;
char str[11] = "bad";
map[str] = 2;           // hashes str = char*
auto x = map["bad"];    // hashes address of "bad"; x!=2

This exemplifies that using the address as key does not work as intended: you cannot obtain an element from the character sequence ("bad")

Upvotes: 6

geipel

Reputation: 405

The code is behaving correctly, since the key is a const char*. Try using std::string as the key to get the behavior you're looking for.

So: std::unordered_map<std::string, char*> myMap;

Upvotes: 1

Does C++ string hashing hash the string or the memory address?

Answers (5)

Related Questions