markt1964
markt1964

Reputation: 2826

Efficiently picking a unique number

Let's say that I have a potentially largish array of 32-bit numbers, perhaps several million entries long... is there a way to efficiently pick some number of the same bit-width that does not appear anywhere in the array? Naively, I could just pick some random number of the appropriate width, then examine the array and if it appears in the array, go back and pick another, but because of the number of elements in the array, the cost of potentially rescanning the array repeatedly is worrisome. In practice, I'm not sure how much of a problem this will be, since there will never be more than perhaps about 20 million entries, while the number of unique values is a few billion, so perhaps the overall likelihood of needing to rescan the array will happen so infrequently that it is not an issue. Still, the fact that such an algorithm could potentially repeatedly rescan the array several times is troublesome to me, and I would ideally like a better solution, if one can be found. Technically, the number does not even have to be random... a deterministic value is acceptable, the only requirement is that the number produced must be unique, and not already appear in the list.

So... is there a runtime efficient way to generate a unique number, or is the random number approach I described above the only real way it can be done? With regards to time/space tradeoffs, I am more interested in speed, so a guaranteed O(n) algorithm would be ideal, but I would probably not want any additional space requirements to be greater than about O(n log n).

This will eventually be implemented in C, but a description of the algorithm in any language neutral terminology would be acceptable.

Upvotes: 2

Views: 107

Answers (4)

גלעד ברקן
גלעד ברקן

Reputation: 23955

This solution has O(n) insertion time, and a maximum of O(n) — but likely far less — time to find the unique number.

Make an additional structure of 2n bits, where each two bits represent whether the array element (element + 1) and (element - 1) exist in the list.

To find a unique number, traverse the bit structure until a bit set to zero is encountered.

When inserting the new number in the array, update the appropriate bits. For example, inserting the element, 2, the bits representing any 3 and 1 in the array would be updated to show that 3-1 (in the case of 3) and 1+1 (in the case of 1) now exist in the array.

Insertion/deletion time can be reduced by adding a pointer from each element to the next element with the same integer.

(adapted from my answer here Efficiently choose an integer distinct from all elements of a list)

Upvotes: 1

Raymond Hettinger
Raymond Hettinger

Reputation: 226266

A Bloom Filter would meet your needs. It lets you make a concise summary of your million element array and provides a fast membership test. It allows false positives but no false negatives which is appropriate for your application which doesn't require perfect randomness.

# python-style-psuedo-code

# build concise searchable summary of the known members
members = BloomFilter(data)

# choose 1000 values known not to be in the members
for i in range(1000):
    candidate = random.randrange(2 ** 32)
    while candidate in members:
          candidate = random.randrange(2 ** 32)
    print candidate

Upvotes: 2

Persixty
Persixty

Reputation: 8589

A good solution is to do as you say (pick and re-pick collisions) but keep the numbers in a hash-table.

If your array of excluded numbers is itself well distributed you don't even need a hash function.

Make sure to store collisions in sorted order to reduce effort by approximately 1/2. Ideally put in bucket collisions in a binary tree but a sorted linked list may work with your numbers.

It's good because it's very tunable by adjusting the number of buckets.

It's O(N) once you've built the hash-table. It's asymptotically average O(N^2) if you're trying to pick 'without replacement' and adding the one you find to the 'exclude' list at each step. However the constant on N^2 will probably be manageably small at your scale.

Notice that picking a random 32-bit value has around a 1:2000 of 'hitting' a 'exclude' list of 2,000,000.

If the exclude list is denser (K ~ 2^32-1) you end up determining a random number in the range (0,2^31-1-K) and then count up to the right gap. But your figures definitely pass any test for the exclusions being small in comparison to the pool size.

If you don't care too much about statistical accuracy you just jump in and then +1 if you hit an 'exclude'.

If you're going to produce accurate statistics in some simulation or cryptography application don't do the +1 bodge. If you're game programming or just looking for a healthy spread in (say) an automated testing suite I would expect it to be fine. Notice the 'clumping' is proportionate the 'exclude' density.

Upvotes: 1

Patricia Shanahan
Patricia Shanahan

Reputation: 26185

Most of the cost of scanning a large array is in the memory access. You can substantially reduce the risk of rescanning by picking a small set of candidate random numbers.

During the scan, compare each member of the set to the current array element. Drop the set member if it matches. If that makes the set empty, you have to go back and begin again with a new set. If you get to the end with a non-empty candidate set, pick any surviving member.

Upvotes: 2

Related Questions