user2033412
user2033412

Reputation: 2109

more efficient structure as unordered_map<pair<int, int>, int>

I have about 20,000,000 pair<int, int> which I need to associate to ints. I did so with an unordered_map<pair<int, int>, int>. Profiling my algorithm shows that checking whether an entry exists or not

bool exists = myMap[make_pair(a, b)] != NULL

is the performance bottleneck. I thought that retrieving this information from an unordered_map would be really fast, as it is O(1). But constant time can be slow if the constant is big...

My hash-function is

template <>
struct tr1::hash<pair<int, int> > {
public:
        size_t operator()(pair<int, int> x) const throw() {
             size_t h = x.first * 1 + x.second * 100000;
             return h;
        }
};

Do you know any better data-structure for my problem?

Obviously I can't just store the information in a matrix, hence the amount of memory wouldn't fit into any computer in existence. All I know about the distribution is that myMap[make_pair(a, a)] doesn't exist for any a. And that all ints are in a continuous range from 0 to about 20,000,000.

Think of it as a sparse 20,000,000x20,000,000-Matrix with about 20,000,000 entries but never on the main diagonal.

Suggestion

Would a vector<pair<int, int>>* (array with N entries) expected to be faster? The lookup for a would be trivial (just the index of the array) and then I would iterate through the vector, comparing the first value of the pair to b.

BIG UPDATE

I uploaded the raw data so you can see the structure.

Upvotes: 0

Views: 3406

Answers (5)

user2033412
user2033412

Reputation: 2109

As suggestet, I went with a vector<pair<int, int>>* with N entries. It's about 40% faster than the unordered_map.

Upvotes: 1

Tony Delroy
Tony Delroy

Reputation: 106068

Main thing is definitely to avoid adding default-constructed elements with every search:

bool exists = myMap[make_pair(a, b)] != NULL; // OUCH

bool exists = myMap.find(make_pair(a, b)) != myMap.end();  // BETTER

iterator i = myMap.find(make_pair(a, b);
if (i != myMap.end()) ... else ...;      // MAY BE BEST - SEE BELOW

And the great hash challenge... woo hoo! This might be worth a shot, but a lot depends on how the numbers in the pairs are distributed and your implementation's std::hash (which is often pass-through!):

    size_t operator()(pair<int, int> x) const throw() {
         size_t hf = std::hash(x.first);
         return (hf << 2) ^ (hf >> 2) ^ std::hash(x.second);
    }

You may also find it faster if you replace the pair with int64_ts, so that the key comparisons are definitely simple integer comparisons rather than cascaded.

Also, what are you doing after the test for existence? If you need to access/change the value associated with the same key then you should save the iterator find returns and avoid another search.

Upvotes: 2

First off, myMap[make_pair(a, b)] != NULL does not do what you think it does. It inserts the pair if it doesn't exist, and compares the mapped value to 0 (which is what NULL expands to). It does not check for existence at all. (Note that in modern C++, you should never use NULL. Use 0 for numbers and nullptr for pointers).

As for the main topic, your hash function doesn't seem too good. Don't forget that arithmetic on ints is done in ints. Since on most compilers int is 32-bit, its maximum value is a little over 2,000,000,000. So 20,000,000 * 10,000 is way bigger than that, leading to overflow (and undefined behaviour).

Given the number of your data, I assume you're on a 64-bit platform, which means size_t is 64 bits long. So you might get better results with a hash function like this:

size_t operator()(pair<int, int> x) const throw() {
     size_t f = x.first, s = x.second;
     return f << (CHAR_BIT * sizeof(size_t) / 2) | s;
}

This should produce significantly less collisions (and have defined behaviour) that what you have now.

If this doesn't help, you could also try a two-step approach:

std::unordered_map<int, std::unordered_map<int, int>>

Lookup by x.first first, then by x.second. I don't know if this would help; measure and see.

Upvotes: 3

Blastfurnace
Blastfurnace

Reputation: 18652

I suggest you test with a better hash function. You can find examples if you search here on SO but this is one possible implementation.

struct pair_hash {
    template <typename T1, typename T2>
    size_t operator()(const std::pair<T1, T2> &pr) const {
        using std::hash;
        return hash<T1>()(pr.first) ^ hash<T2>()(pr.second);
    }
};

Upvotes: 0

Baptiste Wicht
Baptiste Wicht

Reputation: 7663

Have you tried using myMap.find(make_pair(a,b)) != myMap.end() ? operator[] creates the element if it does not exist. I would expect find to be faster.

Upvotes: 5

Related Questions