Karl
Karl

Reputation: 5733

Data structure for storing huge number of indices, each pointing to a set

I am using a red black tree implementation in C++ (std::map), but currently, I see that my unsigned long long int indices get bigger and bigger, for larger experiment. I am going for 700,000,000 indices, and each index stores a std::set that contains a few more int elements (about 1-10). We got 128 GB RAM, but I see that we start to run short of it; in fact, if possible, I wanna go down even to 1,000,000,000 indices, if possible, in my experiment.

I gave this some thought, and was thinking about a forest of several maps put together. Basically, after a map hits a certain size threshold (or perhaps when bad_alloc starts to be thrown), save it to disk, clear it off the memory and then create another map and keep on doing until I got all indices. However, during the loading part, this will be very inefficient, as we can only hold one map in the RAM at a time. Worse, we need to check all maps for consistency.

So in this case, what are some of the data structure should I be looking for?

Upvotes: 1

Views: 1686

Answers (3)

rici
rici

Reputation: 241971

From your description, I think you have this:

typedef std::map<long long, std::set<int>> MyMap;

where the map is very big, and the individual sets are quite small. There are several sources of overhead here:

  • the individual entries in the map, each of which is a separate allocation;
  • the individual entries in the sets, ditto;
  • the structures which describe each set, independent of their contents.

With standard library components, it's not possible to eliminate all of these overheads; the semantics of associative containers pretty well mandates the individual allocation of each entry, and the use of red-black trees requires the addition of several pointers to each entry (in theory, only two pointers are required, but efficient implementation of iterators is difficult without parent pointers.)

However, you can reduce the overhead without losing functionality by combining the map with the sets, using a datastructure like this:

typedef std::set<std::pair<long long, int>> MyMap;

You can still answer all the same queries, although a few of them are slightly less convenient. Remember that std::pair's default comparator sorts in lexicographical order, so all of the elements with the same first value will be contiguous. So you can, for example, query whether a given index has any ints associated with it by using:

it = theMap.lower_bound(std::make_pair(index, INT_MIN));
if (it != theMap.end() && it->first == index) {
  // there is at least one int associated with index
}

The same call to lower_bound will give you a begin iterator for the ints associate with the key, while a call toupper_bound(std::make_pair(key, INT_MAX))` will give you the corresponding end iterator, so you can easily iterate over all the values associated with a given key.

That still might not be enough to store 700 million indices with associated sets of integers in 128GB unless the average set size is really small. The next step would have to be a b-tree of some form, which is not in the standard library. B-trees avoid the individual entry overhead by combining a number of entries into a single cluster; that should be sufficient for your needs.

Upvotes: 2

zaufi
zaufi

Reputation: 7129

it looks like it is time to switch to B-trees (may be B+ or B*) -- this structure used in databases to manage indices. take a look here -- this is replacement for std-like associative containers w/ btree inside... but btrees can be used to keep indices in memory and on disk...

Upvotes: 2

eladidan
eladidan

Reputation: 2644

For such a large scale dataset, you should really work with a proper database server such as an SQL server. These servers are intended to work with cached large-scale datasets. An SQL server saves the data to a permenant cache such as a HDD, while maintaining good read/write performance by caching frequently accessed pages etc.

Upvotes: 1

Related Questions