Using std::map to find duplicates in data and its performance issue. Can I pre-allocate it?

My goal is to check for duplicate IDs in my custom data structure. I'm coding it with Visual Studio 2008 Sp1. So I wrote the following debugger-only function that does "sanity checks" on the data as I work with the code.

I'm using the following approach:

I enumerate through all elements in my data structure (which by itself is relatively fast.)
I then use std::map to collect IDs of all elements as keys, while I make the value part point to the data structure element itself (which will help me quickly identify the duplicate.)
For each element in my data structure I see if its ID is in my std::map, and if so, the function asserts. Otherwise it adds it to the map.

Here's a pseudo-code for that:

std::map<ULONGLONG, BIN_XML_TAG*> dicTagIDs;

for(std::vector<MY_ELEMENT>::const_iterator iTag = arrNode.begin(); iTag != arrNode.end(); ++iTag)
{
    std::map<ULONGLONG, MY_ELEMENT*>::const_iterator itrT = dicTagIDs.find(iTag->uiTagID);
    if(itrT != dicTagIDs.end())
    {
        //Duplicate tag ID!
        MY_ELEMENT* pDuplicateToTag = itrT->second;
        ASSERT(NULL);
    }

    //If not, then add it
    dicTagIDs[iTag->uiTagID] = iTag._Myptr;

    //... may later call itself recursively
}

This works, except that the function above gets exponentially slower as the size of my data structure grows. And since I use it everywhere throughout my "debugger-build" code, it'd be nice to speed it up.

So I was thinking to pre-allocate std::map. I don't know the exact number of elements in my data structure (it is structured kinda like an XML file) but I can make an approximate guess that I could use for preallocation.

But std::map doesn't seem to provide a reserve() method.

So any suggestions how to speed this up?

Upvotes: 1

Answers (3)

Rakurai

Reputation: 1016

Map uses a tree so keys can be traversed in sorted order, so insertion and lookup both occur in lg(n) time. Additionally, insertion can trigger rebalancing the tree. An unordered map uses a hash table, which allows for constant time insertion and lookup, at the cost of traversing the keys. Here is an interesting performance study similar to your use case.

As some have pointed out, you may not have access to unordered maps in VS2008. You could get some of the benefit by using a hash table of your own: just use an array of H maps, and lookup/insert to the array using modulo H. The average depth of the trees underlying the maps will decrease by lg(H), and rebalances should be less costly as well. Example:

std::map<int, int> dicts[HASHKEY]; // array of H maps
dicts[id % HASHKEY][id] = whatever; // insertion
auto it = dicts[id % HASHKEY].find(id); // lookup

Here is some example code that demonstrates the speedup:

#include <cstdlib>
#include <random>
#include <vector>
#include <map>
#include <iostream>
#include <ctime>
#include <cassert>

#define MAX_HASH 16384

int main() {
    unsigned int nIDs = 1 << 21;
    std::vector<int> IDs(nIDs);

    // random fill
    std::random_device rnd;
    std::mt19937 mersenne_engine(rnd());
    std::uniform_int_distribution<int> dist(0, nIDs); // get some collisions
    auto gen = std::bind(dist, mersenne_engine);
    generate(begin(IDs), end(IDs), gen);

    std::map<int, int> dicts[MAX_HASH];

    // time insertion into 1..MAX_HASH maps
    for (unsigned int hash = 1; hash <= MAX_HASH; hash <<= 1) {
        for (unsigned int i = 0; i < hash; ++i)
            dicts[i].clear();

        // time simple insertion/overwrite
        clock_t start = clock();

        for (unsigned int i = 0; i < nIDs; ++i) {
            int id = IDs[i];
            dicts[id % hash][id] = 1;
        }

        double insert_elapsed = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

        // time lookup
        // shuffle the vector to avoid optimal lookups
        std::shuffle(begin(IDs), end(IDs), mersenne_engine);        
        start = clock();

        for (unsigned int i = 0; i < nIDs; ++i) {
            int id = IDs[i];
            auto it = dicts[id % hash].find(id);
            assert(it->second == 1);
        }

        double lookup_elapsed = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

        // time lookup+insertion
        for (unsigned int i = 0; i < hash; ++i)
            dicts[i].clear();

        start = clock();

        for (unsigned int i = 0; i < nIDs; ++i) {
            int id = IDs[i];
            unsigned int dnum = id % hash;
            auto it = dicts[dnum].find(id);

            if (it == dicts[dnum].end())
                dicts[dnum][id] = 1;
        }

        double cond_elapsed = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

        printf("%5d maps, avg. %7ld keys - lookup: %1.3f, lookup: %1.3f, conditional insert: %1.3f\n",
            hash, IDs.size()/hash, insert_elapsed, lookup_elapsed, cond_elapsed);
    }
}

Output:

    1 maps, avg. 2097152 keys - insert: 1.958, lookup: 2.010, conditional insert: 2.473
    2 maps, avg. 1048576 keys - insert: 1.956, lookup: 1.988, conditional insert: 2.390
    4 maps, avg.  524288 keys - insert: 1.911, lookup: 1.982, conditional insert: 2.416
    8 maps, avg.  262144 keys - insert: 1.848, lookup: 1.956, conditional insert: 2.407
   16 maps, avg.  131072 keys - insert: 1.846, lookup: 1.973, conditional insert: 2.312
   32 maps, avg.   65536 keys - insert: 1.814, lookup: 1.882, conditional insert: 2.303
   64 maps, avg.   32768 keys - insert: 1.796, lookup: 1.973, conditional insert: 2.258
  128 maps, avg.   16384 keys - insert: 1.807, lookup: 1.879, conditional insert: 2.250
  256 maps, avg.    8192 keys - insert: 1.785, lookup: 1.849, conditional insert: 2.158
  512 maps, avg.    4096 keys - insert: 1.731, lookup: 1.820, conditional insert: 2.229
 1024 maps, avg.    2048 keys - lookup: 1.743, lookup: 1.787, conditional insert: 2.050
 2048 maps, avg.    1024 keys - lookup: 1.614, lookup: 1.709, conditional insert: 1.993
 4096 maps, avg.     512 keys - lookup: 1.628, lookup: 1.638, conditional insert: 1.963
 8192 maps, avg.     256 keys - lookup: 1.549, lookup: 1.568, conditional insert: 1.803
16384 maps, avg.     128 keys - lookup: 1.468, lookup: 1.474, conditional insert: 1.688

As you can see, the performance gets better as the number of maps increases, at the cost of more memory allocated to the map objects.

Upvotes: 1

user2100815

Reputation:

For each element in my data structure I see if its ID is in my std::map, and if so, the function asserts. Otherwise it adds it to the map.

You probably don't want to do do that, simply use the map's insert() member function, which will fail if the key is already in the map.

And you can't reserve space in tree-based containers like std::map.

Upvotes: 3

Curious

Reputation: 21510

std::unordered_maps are faster for key lookups (much faster in my experience) in most implementations (algorithmically, they are O(1) as opposed to the O(log(n)) for std::map).

And they have a reserve() method

Upvotes: 3

Using std::map to find duplicates in data and its performance issue. Can I pre-allocate it?

Answers (3)

Related Questions