Sampling intervals, not numbers, without replacement

Question

The sort of problem I am dealing with involves a few things, namely:

I need to randomly sample numbers from a range of numbers.
That range of numbers is really huge, as from 1 to 1,000,000,000.
I need the sampling process to avoid sampling from intervals within the range that have already been sampled. Since using an array is too slow, my attempts to use splice are not going to work.

I start by picking a number between 1 and 1,000,000,000.

my $random = int(rand(1_000_000_000)) + 1;

I add a value, say 100, to that to make $random and $random + 100 define an interval.

my $interval = $random + 100;

Then I push both $random and $interval into another array. This other array is to store the intervals.

push ( @rememberOldIntervals, $random, $interval );

I step through array @rememberOldIntervals using a for loop, pulling out items in pairs. The first of a pair is a former $random and the other a $interval. Inside this for loop, I do another random number generation. But the number generated can't be between an interval already taken. If so, keep sampling until a number is found that is unique. Further, this new random number must be at least 100 away from any old interval.

for ( my $i= 0; $i < (scalar @rememberOldIntervals) / 2 ; $i=+2) {
      $random = int(rand(1_000_000_000)) + 1;
      my $new_random_low  = $random - 100;
      my $new_random_high = $random + 100;

      if ( $new_random_low  <= $rememberOldIntervals[0] OR 
           $new_random_high >= $rememberOldIntervals[1]    ){

          push( @rememberOldIntervals, $new_random_low, $new_random_high ); 
      }

      else {
            until ($new_random_low  <= $rememberOldIntervals[0] OR 
                   $new_random_high >= $rememberOldIntervals[1]    ) {

                   $random = int(rand(1_000_000_000)) + 1;
                   my $new_random_low  = $random - 100;
                   my $new_random_high = $random + 100;
            }
      }

}

This latter loop would need to be embedded within another to drive it many times, say 10,000 times.

Miller · Accepted Answer

This problem can be reframed into pulling 10,000 random numbers between 0 and 1 billion, where no number is within 100 of another.

Brute Force - 5 secs

Because you're only pulling 10,000 numbers, and probably don't need to do it very often, I suggest approaching this type of problem using brute force initially. This is trying to follow the design pattern of Premature optimization is the root of all evil

In this case, that means just pulling random numbers and comparing them to all previously pulled numbers. This will have a speed of O(N^2), but will also take less code.

use strict;
use warnings;

my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;

die "Too many numbers" if 2 * $dist * $count >= $max;

my @numbers;

while (@numbers < $count) {
    my $num = int rand $max;
    push @numbers, $num if ! grep {abs($num - $_) < $dist} @numbers;
}

print scalar(@numbers), "
";

Output takes 5 seconds:

Binary Search for faster generation - 0.14 secs

Now for faster algorithm, I agree with ysth that a much more efficient method to solve this is to create two lists of your random numbers. One of them is the running list, and the other is sorted. Use the sorted list to do a binary search for placement and then comparison to its nearby elements to see if it is within 100.

This reduces the number of comparisons from O(N^2) to O(N log N). The following takes just 0.14 seconds to run versus the 5 seconds of the brute force method.

use strict;
use warnings;

my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;

die "Too many numbers" if 2 * $dist * $count >= $max;

my @numbers;
my @sorted = (-$dist, $max);   # Include edges to simplify binary search logic.

while (@numbers < $count) {
    my $num = int rand $max;

    # Binary Search of Sorted list.
    my $binary_min = 0;
    my $binary_max = $#sorted;
    while ($binary_max > $binary_min) {
        my $average = int( ($binary_max + $binary_min) / 2 );
        $binary_max = $average if $sorted[$average] >= $num;
        $binary_min = $average + 1 if $sorted[$average] <= $num;
    }

    if (! grep {abs($num - $_) < $dist} @sorted[$binary_max, $binary_max - 1]) {
        splice @sorted, $binary_max, 0, $num;
        push @numbers, $num;
    }
}

print scalar(@numbers), "
";

Hash of quotients for fastest - 0.05 secs

I inquired in the comments: "Could you simplify this problem to pick a random multiple of 100? That would ensure no overlap, and then you'd just need to pick a random number from 1 to 10 million without repeat, and then just multiply it by 100." You didn't respond, but we can still use grouping by multiples of 100 to simplify this problem.

Basically, if we keep track of a number's quotient divided by 100, we only need it to compare it to numbers with quotients plus and minus one. This reduces the number of comparisons to O(N), which not surprisingly is the fastest at 0.05 seconds:

use strict;
use warnings;

my $max = 1_000_000_000;
my $dist = 100;
my $count = 10_000;

die "Too many numbers" if 2 * $dist * $count >= $max;

my @numbers;
my %num_per_quot;

while (@numbers < $count) {
    my $num = int rand $max;

    my $quotient = int $num / $dist;

    if (! grep {defined && abs($num - $_) < $dist} map {$num_per_quot{$quotient + $_}} (-1, 0, 1)) {
        push @numbers, $num;
        $num_per_quot{$quotient} = $num;
    }
}

print scalar(@numbers), "
";

Caution if you're on Windows

If you run this code on Windows and are using a version of perl less than v5.20, you'll need to use a better random number generate than the built-in rand. For reasons why, read avoid using rand if it matters.

I used Math::Random::MT qw(rand); in this code since I'm on Strawberry Perl v5.18.2. However, starting with Perl v5.20 this will no longer be a concern because rand now uses a consistent random number generator.

Sampling intervals, not numbers, without replacement

Answers (2)

Related Questions