sinclair
sinclair

Reputation: 1

An integer hashing problem

I have a (C++) std::map<int, MyObject*> that contains a couple of millions of objects of type MyObject*. The maximum number of objects that I can have, is around 100 millions. The key is the object's id. During a certain process, these objects must be somehow marked( with a 0 or 1) as fast as possible. The marking cannot happen on the objects themselves (so I cannot introduce a member variable and use that for the marking process). Since I know the minimum and maximum id (1 to 100_000_000), the first thought that occured to me, was to use a std::bit_set<100000000> and perform my marking there. This solves my problem and also makes it easier when marking processes run in parallel, since these use their own bit_set to mark things, but I was wondering what the solution could be, if I had to use something else instead of a 0-1 marking, e.g what could I use if I had to mark all objects with an integer number ?

Is there some form of a data structure that can deal with this kind of problem in a compact (memory-wise) manner, and also be fast ? The main queries of interest are whether an object is marked, and with what was marked with.

Thank you.

Note: std::map<int, MyObject*> cannot be changed. Whatever data structure I use, must not deal with the map itself.

Upvotes: 0

Views: 167

Answers (4)

Andy Finkenstadt
Andy Finkenstadt

Reputation: 3587

The most important question to ask yourself is "How many of these 100,000,000 objects might be marked (or remain unmarked)?" If the answer is smaller than roughly 100,000,000/(2*sizeof(int)), then just use another std::set or std::tr1::unordered_set (hash_set previous to tr1) to track which ones are so marked (or remained unmarked).

Where does 2*sizeof(int) come from? It's an estimate of the amount of memory overhead to maintain a heap structure in a deque of the list of items that will be marked.

If it is larger, then use std::bitset as you were about to use. It's overhead is effectively 0% for the scale of quantity you need. You'll need about 13 megabytes of contiguous ram to hold the bitset.

If you need to store a marking as well as presence, then use std::tr1::unordered_map using the key of Object* and value of marker_type. And again, if the percentage of marked nodes is higher than the aforementioned comparison, then you'll want to use some sort of bitset to hold the number of bits needed, with suitable adjustments in size, at 12.5 megabytes per bit.

A purpose-built object holding the bitset might be your best choice, given the clarification of the requirements.


Edit: this assumes that you've done proper time-complexity computations for what are acceptable solutions to you, since changing the base std::map structure is no longer permitted.

Upvotes: 1

Pablo
Pablo

Reputation: 8644

If you don't mind using hacks, take a look at the memory optimization used in Boost.MultiIndex. It can store one bit in the LSB of a stored pointer.

Upvotes: 0

Oliver Charlesworth
Oliver Charlesworth

Reputation: 272557

If you're not concerned with memory, then a std::vector<int> (or whatever suits your need in place of an int) should work.

If you don't like that, and you can't modify your map, then why not create a parallel map for the markers?

std::map<id,T> my_object_map;
std::map<id,int> my_marker_map;

If you cannot modify the objects directly, have you considered wrapping the objects before you place them in the map? e.g.:

struct
{
    int marker;
    T *p_x;
} T_wrapper;


std::map<int,T_wrapper> my_map;

If you're going to need to do lookups anyway, then this will be no slower.

EDIT: As @tenfour suggests in his/her answer, a std::pair may be a cleaner solution here, as it saves the struct definition. Personally, I'm not a big fan of std::pairs, because you have to refer to everything as first and second, rather than by meaningful names. But that's just me...

Upvotes: 3

tenfour
tenfour

Reputation: 36896

How about making the value_type of your map a std::pair<bool, MyObject*> instead of MyObject*?

Upvotes: 4

Related Questions