dontloo
dontloo

Reputation: 10865

what is the fastest way to randomly select some numbers in a range that contains a given set of numbers?

Here is the function I want

random_select(contain_list, ttl_num, sample_num)

There're ttl_num integers from 0 to ttl_num-1 to select from, I want to return a list of sample_num unique integers, where numbers provided in contain_list have to be in the list, and other numbers are selected randomly.

I have to do this query very often, every time with a different contain_list, but ttl_num, sample_num are the same for all the queries.

Currently what I'm doing is, first generate a set of ttl_num integers, subtract contain_list from the set, randomly select some numbers with no replacement from the rest, then concatenate it with the contain_list to get the result.

I believe this is not the fastest way, any better thoughts?

It's okay to use global variables if needed.

Edit:
sample_num is no less than contain_list length and I want to get contain_list plus sample_num - contain_list.length other random numbers
it's guaranteed that numbers in contain_list are all within the range 0 to ttl_num-1.

Upvotes: 0

Views: 179

Answers (2)

dontloo
dontloo

Reputation: 10865

I just wrote some code similar to method 1 from James Droscha's answer in a vectorized way using numpy, which turned out to be just several lines of code,

def random_select(batch, ttl_num, sample_num):
    # add the following line if elements in batch are not guaranteed to be unique
    # batch = np.unique(batch)
    batch_size = len(batch)
    # step 1
    candidates = np.arange(ttl_num)
    # step 4
    candidates[batch] = candidates[-batch_size:]  # so that elements in candidates[:ttl_num-batch_size] are not contained in batch
    # step 5
    idx = np.random.choice(ttl_num-batch_size, sample_num-batch_size, replace=False)
    return np.concatenate([candidates[idx], batch])

Upvotes: 1

James Droscha
James Droscha

Reputation: 181

Here are a couple of possibilities. Neither is less complex than what you already have, but one or both of them could turn out to be faster, depending on the size of the parameter values. Only benchmarking with your actual data would tell for sure.

Method 1

The logic here is essentially the same as what you are already doing. It just replaces the set generation and operations with an integer array, which should be lighter weight. It does, however, require a sort (descending) on contain_list, so whether or not it actually runs faster than what you already have probably depends on the size of contain_list.count and ttl_num.

1) initialize a tracking var, remaining_num = ttl_num

2) initialize an integer array with value = index

3) sort contain_list descending

4) iterate through contain_list (now in descending order); for each:
4.1) decrement remaining_num
4.2) swap the element at the selected index with the one at index = remaining_num

5) iterate (sample_num - contain_list.count) times; for each:
5.1) generate a random index between 0 and remaining_num (inclusive and exclusive, respectively)
5.2) decrement remaining_num
5.3) swap the element at the selected index with the one at index = remaining_num

6) The resultant samples will start at index reamining_num and run through the end of the array.

Here is an example run for random_select({3, 7}, 10, 5)...

remaining_num = 10

available_num[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

contain_list = {7, 3}

select the 7
remaining_num = 9
available_num[] = {0, 1, 2, 3, 4, 5, 6, 9, 8, 7}

select the 3
remaining_num = 8
available_num[] = {0, 1, 2, 8, 4, 5, 6, 9, 3, 7}

select a random(0,8), e.g. 2
remaining_num = 7
available_num[] = {0, 1, 9, 8, 4, 5, 6, 2, 3, 7}

select a random(0,7), e.g. 3
remaining_num = 6
available_num[] = {0, 1, 9, 6, 4, 5, 8, 2, 3, 7}

select a random(0,6), e.g. 0
remaining_num = 5
available_num[] = {5, 1, 9, 6, 4, 0, 8, 2, 3, 7}

result = {0, 8, 2, 3, 7}

Method 2

If ttl_num is sufficiently large and sample_num is sufficiently low, it might be worth turning things upside down. That is, instead of creating and manipulating a set of available numbers, track only a list of selected numbers. Then, when selecting each random target, "skip" previously selected numbers by iterating through the list of selected numbers and counting how are less than or equal to the target index.

1) initialize a tracking var, remaining_num = ttl_num - contain_list.count

2) declare an empty list (vector) of integers, selected_num[]

4) iterate through contain_list; for each:
4.1) insert cointain_list[i] into selected_num[]

5) iterate (sample_num - contain_list.count) times; for each:
5.1) generate a random target between 0 and remaining_num (inclusive and exclusive, respectively)
5.2) decrement remaining_num
5.3) iterate through selected_num; for each:
5.3.1) if target >= selected_list[j], increment target
5.4) insert target into selected_num[]

6) The resultant samples will be all elements in selected_num.

Here is an example run for random_select({3, 7}, 10, 5)...

remaining_num = 8

selected_num[] = {}

select the 3
selected_num[] = {3}

select the 7
selected_num[] = {3, 7}

select a random(0,8), e.g. target = 2
remaining_num = 7
2 < 3; target still 2
2 < 7; target still 2
selected_num[] = {3, 7, 2}

select a random(0,7), e.g. target = 3
remaining_num = 6
3 >= 3; target becomes 4
4 < 7; target still 4
4 >= 2; target becomes 5
selected_num[] = {3, 7, 2, 5}

select a random(0,6), e.g. target = 0
remaining_num = 5
0 < 3; target still 0
0 < 7; target still 0
0 < 2; target still 0
0 < 5; target still 0
selected_num[] = {3, 7, 2, 5, 0}

Obviously, iterating through selected_num[] when selecting each new number could get expensive if sample_num is large. This could be alleviated somewhat by maintaining selected_num[] in descending sort order and breaking the inner loop as soon as you see a number smaller than the target. Insert the target at that point in the list to maintain the sort.

Upvotes: 1

Related Questions