Reputation: 1373

Is it bad practice to maintain two copies of the same data, if they are in different data structures?

Say I have a universal set of indexed objects, U, and a subset of these objects, S. S is large (say, 1,000,000 elements), however U is MUCH larger (say, 100,000,000 at least).

I would like to perform two basic operations on these sets:

(1) Given any integer x from 0 to the size of U minus 1, check for membership of S, if not a member, then add x to S, and

(2) Select (and remove) a random element from S.

In order to perform the first part of operation (1), it makes sense to me to keep a boolean vector v the size of U, where the value is true if element x is a member of set S.

However, because U is so much larger than S, choosing a random element in v and hoping that it is also an element in S does not make sense. I if U is 100 times larger than S, then it would only find an element of S, on average, one time every 100 tries.

So, in order to perform the second operation, it makes sense to maintain a list of indices of elements that are in S and select a random element from that.

The only problem now, is that there is now two copies of the same data, and they both need to be updated separately with each operation. Here is the pseudocode for the first operation:

** operation 1 - check membership and add **
input: boolean vector, v
       integer vector, S
       integer, x

if v[x] is not true:
    v[x] = true
    append x to S
return

That is relative simple, but it has to update the vector of indices even though it didn't use it. Here is the second operation:

** operation 2 - select and remove random element of S **
input: boolean vector, v
       integer vector, S

generate random integer x between 0 and size of S
set v[S[x]] to false
remove S[x] from S
return

Maintaining two copies of the data has made both of those operations more complicated, because each has to update both data structures, even if it only needs one. Is this bad practice?

The only alternative I could come up with is to use one or the other. But that makes one operation simpler, but the other more complicated. For example (only the more complicated ones are given):

** operation 1 - check membership and add**
input: integer vector, S
       integer, x

iterate over S
if x in S:
    return
else:
    append x to S
    return

So every time, it will have to iterate over the entire S, instead of a single look-up, and

** operation 2 - select and remove random element of S **
input: boolean vector, v

while true:
    generate random integer x between 0 and size of S
    if v[x] true:
        v[x] = false
        return

Both of these seem pretty inefficient, especially if the sizes of U and S are large, and the difference between U and S is also large. Is there a way that I can perform both of these operations efficiently with only one data structure? or is there not really a big problem with maintaining two copies of the same thing?

EDIT:

The code I am writing is in c++, so I guess I am asking about c++ data structures in particular, but the question isn't really language specific.

Upvotes: 4

Answers (3)

user4842163

Reputation:

If you don't mind rolling up your sleeves, I recommend a "sparse hierarchical bitset" (best name I could come up for this -- I suck at naming things). Something to this effect:

I use this structure as my general replacement of std::set<int> since it's so much faster to iterate, insert to, remove from, and perform set operations (intersection, union, difference) at least for my kinda use cases (which seem similar to yours and with similar input sizes in the hundreds of millions). And naturally it takes way, way, way, way less memory with those kinds of input sizes over a balanced binary tree (like the difference between 15 megabytes and 2 gigabytes) or hash table storing integers. After all, a single 64-bit integer could instead be used to indicate up to 64 indices worth of data instead of one.

It can do set intersections between a couple hundred million densely packed elements in 3-4 iterations (log(N)/log(64) iterations in best-case, N*log(N)/log(64) in worst-case), and in less than a nanosecond. Meanwhile because it's sparse in the sense that it just stores a null pointer and avoids allocating memory for completely unused ranges of indices, it doesn't take an explosive amount of memory (around 1-2 bits typically on average for each index inside the set). Hopefully the diagram gives a decent enough clue on how to implement it. The set intersection isn't actually stored (otherwise it would be considerably more expensive). Instead the data structure invokes a callback like so:

// Indicates that [first, last) are in the resulting set.
typedef void SetResults(int first, int last, void* user_data);

... couldn't use functors since it's implemented in C. Of course this structure has a glaring weakness and that's when you don't have many contiguous ranges of indices in the set -- with the worst-case being like every index being an odd number, or every index being an even number, at which point the root node would store a bunch of 0s and every single search would require drilling down to the leaves and checking individual bits for individual indices, hitting that worst-case N*Log(N)/Log(64) scenario to do things like set intersections. For my use cases however, the indices tend to be mostly contiguous, and it still outperforms std::set<int> even in the worst-case scenario when doing things like set intersections and checking to see what elements are in the set. For worst-case scenarios, it's still often an order of magnitude faster than std::set. For best-case ones, it tends to be millions of times faster on these types of input sizes. For average scenarios it tends to be thousands of times faster, like the difference between a millisecond and second.

If that takes too much time to implement, then my second suggestion is just a sparse bitset without bothering with making it hierarchical. Store blocks of bitsets (an array of pointers to arrays, e.g.). When all the bits in a block are set to 0, free that block and set its pointer to null to avoid occupying more memory than pointer for that entire range of indices. That's trivial to implement and still pretty fast because it can compare many bits at once using bitwise instructions.

Now when you do a set union as you do for step 1, all you do is iterate through the blocks that are occupied in both sets in parallel and do a bitwise or. You can do a set union on 64 elements (64 bits) at a time this way or even more if you use SIMD (ex: 512 elements at a time). This is a linear-time set union but it's worth noting that the empty blocks let you skip over, say, 1024 unused bits at a time while when you encounter a block occupied in both sets, you can then use bitwise operations to perform unions 64+ bits at a time.

For #2, first compute a set intersection of indices in both U and S using bitwise and (super fast even with just a sparse bitset: use S to iterate through originally since it's the much smaller set). Then use a random number generator to generate an index for a random block of bits from the resulting intersection set. Then you might use FFS/FFZ starting from a random bit.

Simple diagram to illustrate:

... but of course with more than 3 bits per block (in actuality, maybe 1024+ bits per allocated block). For set intersection we begin with S and see that block 0 is empty, so we skip over it. Then we look at block 1 and see that it's not null.

So now we look at block 1 in U and also see that's not null. Okay, so now we do a bitwise and to get the resulting bits in both S and U. That's your set intersection. It's technically linear-time but more like O(N/64+) in the worst-case scenario (better when it can skip over many entirely vacant blocks), not O(N), since we can do bitwise operations on many bits at once (64 bits at a time easily on 64-bit machines, more with SIMD).

So, in order to perform the second operation, it makes sense to maintain a list of indices of elements that are in S and select a random element from that.

This need can become eliminated with the above data structure. As for the general question about best practices, if you need another data structure, then you need another data structure. Some people say parallel arrays are horrible in spite of the fact that they are often the fastest solution for sequential processing (SoA approach), for example, since you have to maintain them in sync with the source array. But you can always put that behind a class and test it well. It's not so bad as long as you aren't leaking the need to manually maintain those parallel arrays in sync to many places in the system.

However, in this case, if you use one of the two data structures suggested above, you don't need to maintain a separate random-access sequence of indices to efficiently find a random one that exists in both sets, since you can find the set intersection so rapidly as well as on the fly as you need it.

Upvotes: 1

sp2danny

Reputation: 7644

After some consideration, I think std::map might be best suited.

#include <random>
#include <iterator>
#include <map>
#include <vector>

struct Data {};                           // Your actual Object here

constexpr auto universal_size = 100'000;  // I shrank it a little for
constexpr auto subset_size = 1'000;       // the example

std::vector<Data> U(universal_size);      // This is the indexed data store

std::map<int, Data*> S;                   // This is the subset

void add_if_not_in(int idx)               // idx is universal index.
{                                         // This is one of the
    S[idx] = &U[idx];                     // functionalities you
}                                         // requested.

void remove_by_universal_index(int idx)   // Not strictly needed.
{                                         // Removes object from 
    S.erase(idx);                         // subset, by universal
}                                         // index.

void remove_by_subset_index(int idx)      // Removes object from
{                                         // subset, by subset
    auto iter = S.begin();                // index. Used by 
    std::advance(iter, idx);              // remove_random()
    S.erase(iter);
}

std::mt19937 gen{};                       // A random generator

void remove_random()                      // The second functionality
{                                         // you requested.
    auto sz = S.size();                   // Removes one random element
    std::uniform_int_distribution<>       // from the subset.
        dis(0, sz-1);
    auto num = dis(gen);

    remove_by_subset_index(num);
}

void add_random()                         // Used to initialize subset.
{                                         // Adds one random element of
    auto sz = U.size();                   // universal set to subset.
    std::uniform_int_distribution<>
        dis(0, sz-1);
    auto idx = dis(gen);

    add_if_not_in(idx);    
}

void setup()                              // Initialize subset.
{                                         // Just add random until
    while (S.size() < subset_size)        // size is specified.
        add_random();
}

int main()                                // Try it 
{
    setup();
    add_random();
    remove_random();
}

try code online here http://coliru.stacked-crooked.com/a/8236da0ccaf05079

Upvotes: 1

bolov

Reputation: 75697

I don't see a (major) problem with either 3 of the approaches. When deciding for one of them you have to take into consideration:

code readability
code maintainability
performance

Code readability

How easy and intuitive is to understand what the code does. The code should not have any surprising behavior.

All three can be reasonably equally readable if using good naming and clean structured code.

Code maintainability

How easy is is to debug, test, extend the code.

The variant with two structures has a slightly more cost. But just slightly. I don't see as much more complexity compared to the other ones. And you could have a test in your unit tests that checks the integrity of the scheme. I.e. check if the boolean vector and the integer vector agree as to what S is.

Performance

You might make assumptions all day long about what variant and by how much is faster, but at the end of the day any talk about performance is pointless without actual profiling. If performance is an important factor to you, then implement all 3 methods and measure the actual performance of them.

Upvotes: 2

Is it bad practice to maintain two copies of the same data, if they are in different data structures?

Answers (3)

Code readability

Code maintainability

Performance

Related Questions