Klaim
Klaim

Reputation: 69672

Most efficient associative container with std::string as key?

I've read somewhere that std::map is, with current compilers, still the most efficient associative container we have in the STL, even with std::unsorted_map that --from what I read somewhere, I'm not sure where-- becomes more efficient on find() only if there is a lot of entries, like more than 40k.

So now I'm not really sure anymore because I always assumed that a hash map is more efficient at least in case of string keys.

So to be short:

If I have to choose an associative container with unknown entry count and with std::string as keys, what would be (at least in theory) the more efficient (on speed) choice for finding?

Upvotes: 2

Views: 1819

Answers (2)

ex0du5
ex0du5

Reputation: 2624

When you have 40k entries or more, strings (or lists of elements, etc.) should not be used as associative keys in the standard containers. Instead, there comes a point much earlier where a trie or a ternary tree become better options. Both of those can build associative structures that only compare each character of your string (or element of your list, etc.) once. Ordered maps compare at every node (and so are O(m log n) - m size of string, n number of elements), and the unordered maps suffer from far more collisions at those sizes.

A ternary tree (each child branches to characters less, equal, or greater on a single char compare) takes the least memory of the better implementations, but tries are by far the fastest. Both of these may be built from boost.graph or some other generic graph library.

Upvotes: 1

Kerrek SB
Kerrek SB

Reputation: 476970

Profile, profile, profile...

The problem with strings as keys is that comparing them is very slow (think difference in the last character of a 1000-character string). The advantage of an unordered_map with a string key comes at least in part from the fact that only the fixed-width hash values have to be compared, so in practice the unordered map may well be a lot faster.

The hash implementation may choose, for example, to use only a fixed number of spread-out digits to compute the hash value and thus end up putting some near-identical strings in the same bucket, so it's a trade-off. You can probably concoct a set of key values for which both containers would perform very poorly, but for a "random" or "typical" collection of strings, my bet is on the hash container.

Upvotes: 10

Related Questions