Reputation: 1396

What data structure to use? (hash map vs. trie vs. ?)

I have a C function that produces about 6 million unique arrays. These arrays always have 17 elements each, and each element is an integer from 0 to 16. I also have a slightly modified version of that function that will also produce about 6 million unique arrays of the same kind. My problem is that the second one produces about 45,000 results less than the first, and I'd like to see what these results are.

So my approach is to simply store all the results of the second function (calculator tells me this should not take more than 400 mb which is fine to keep in-memory) and then look up the results of the first, printing out the ones that don't exist.

Assuming the general approach makes sense (and if not, do tell), what I am looking for is an appropriate data structure (ideally with a good implementation in C) that can hold about 6 million unique permutations of

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]

(or some transformation thereof) and then perform fast membership testing on them. As the title says, I do have some suspicions about which data structures may do the job, but I am not certain tries or hashmaps are the best choice for this.

This is an algorithm to detect a flaw in another algorithm, not something that will be used in production. I am interested in doing this in a way that will be coded and return results relatively quickly in human terms, not necessarily shave milliseconds, so existence of easy to grok libraries that will do most of the job is definitely a plus.

Upvotes: 3

Answers (6)

edwinc

Reputation: 1676

a) create a struct that contains two 64-bit int's

b) since each result has 17 elements, multiply the first 8 and put the result on the first int, multiply the other 7 and put the result on the second int.

c) create an operator < for your struct

d) create a set of your struct and insert all of your results from your first run

e) iterate through your second run results and do a set::find()

class Result
{
public:
    Result(int arr[17]);              // Fill-in _n1 and _n2

    bool operator < (const Result& r) const  // Compare
    { 
        if (_n1 != r._n1)
           return _n1 < r._n1;
        return _n2 < r._n2;
    }

protected:
    int _n1;
    int _n2;
};

typedef std::set< Result > SetResult;
SetResult setResult;

Edwin

Upvotes: 0

J D

Reputation: 48687

I think you need to weigh up the value doing this in C just to avoid communication.

I would print each array from C line-by-line as space-separated integers. Then load that in from file to create a set of byte arrays like this (F# code):

let load file =
  System.IO.File.ReadAllLines file
  |> Array.Parallel.map (fun s -> s.Split[|' '|] |> Array.map (int >> byte))
  |> set

and then compute the set difference between two files like this:

load "file1.txt" - load "file2.txt"

That would probably take a few minutes to run.

Upvotes: 3

Nablamax

Reputation: 11

@Steve Jessop You can do the last step in linear time, doing a smarter search by removing unneeded values of the array that we are searching in:

Lets suppose n the size of A and m the size of B,

int counter_A = 0;
int counter_B = 0;
int counter_C = 0;
while(counter_A != n){
    int temp = A[counter_A];
    counter_A++;
    //Removes all elements at the beginning of B since they are inferior than all
    //elements in A because they are inferior than the minimum of A
    for(;counter_B < m && B[counter_B] < temp;counter_B++);
    if((counter_B < m && B[counter_B] > temp) || counter_B == m){
        C[counter_C] = temp;
        counter_C++;
    }
}

This should perform in O(n+m) time since every step of the algorithm performs at least one incrementation of a counter.

Upvotes: 1

Steve Jessop

Reputation: 279285

Keeping it simple:

Represent each permutation as an array of 17 bytes
Store the entire smaller set as an array of the above (17*6M < 98MB)
Sort it, in lexicographical order, so your comparator for qsort just calls memcmp(left, right, 17)
For each element of the larger set, look it up in the sorted array using a binary chop (use the same comparator as before, this time with bsearch).

Each of the last two steps will perform something of the order of 6M * log(6M) comparisons, which is about 138M. Which is probably still less time than it takes to write the code. Which isn't long, since everything is so simple :-)

Upvotes: 2

Crashworks

Reputation: 41404

Optimality would kind of depend on how the permutations are distributed and the ratio of insertions to searches. Since you are not concerned with optimality, but just want a straightforward way to test a hypothesis without waiting all night for results, my gut says:

An integer [0,16] can be represented as a five bit number, so seventeen of them can be represented as an 85-bit (11-byte) binary string. So, you can just use one of the many libraries available for storing sorted/hashed sets of strings with membership tests on them, and be done. It won't be quite as fast or cache-coherent as a tuned trie, but it'll be good enough to grind through 66mb of data in a few seconds, and you'll be done by lunch.

If no such library is conveniently to hand and you have to work from scratch, I'd just make a sorted list of the strings and then do the membership tests via binary search. That works out to something like O( n log n + m( n log n ) ) = O( 2×mn log n ) eg quadratic time as m→n. If this is only being run as an offline job once or twice during production, that might be good enough; if you're going to do this more than once a day, I'd worry about cache locality and use a trie or B-tree.

Upvotes: 5

phoxis

Reputation: 61920

Depends on which one in your case would get better memory performance. Also what hash function you use, how do you resolve a collision etc. How about checking out a Hash Array Mapped Trie (HAMT)

Upvotes: 1

What data structure to use? (hash map vs. trie vs. ?)

Answers (6)

Related Questions