Reputation: 418

Generating random sublist from ordered list that maintains ordering

Consider a problem where a random sublist of k items, Y, must be selected from X, a list of n items, where the items in Y must appear in the same order as they do in X. The selected items in Y need not be distinct. One solution is this:

for i = 1 to k
    A[i] = floor(rand * n) + 1
    Y[i] = X[A[i]]
sort Y according to the ordering of A

However, this has running time O(k log k) due to the sort operation. To remove this it's tempting to

high_index = n
for i = 1 to k
    index = floor(rand * high_index) + 1
    Y[k - i + 1] = X[index]
    high_index = index

But this gives a clear bias to the returned list due to the uniform index selection. It feels like a O(k) solution is attainable if the indices in the second solution were distributed non-uniformly. Does anyone know if this is the case, and if so what properties the distribution the marginal indices are drawn from has?

Upvotes: 8

Answers (6)

wildplasser

Reputation: 44250

The original list X has n items. There are 2**n possible sublists, since every item will or will not appear in a sublist: each item adds a bit to the enumeration of the possible sublists. You could view this enumeration of a bitword of n bits.

Since your are only want sublists with k items, you are interested in bitwords with exactly k bits set. A practical algorithm could pick (or pick not) the first element from X, and then recurse into the rightmost n-1 substring of X, taking into account the accumulated number of chosen items. Since the X list is processed in order, the Y list will also be in order.

#include <stdio.h>
#include <string.h>

unsigned pick_k_from_n(char target[], char src[], unsigned k, unsigned n, unsigned done);
unsigned pick_k_from_n(char target[], char src[]
                , unsigned k, unsigned n, unsigned done)
{
unsigned count=0;

if (k>n) return 0;

if (k==0) {
        target[done] = 0;
        puts(target);
        return 1;
        }
if (n > 0) {
        count += pick_k_from_n(target, src+1, k, n-1, done);

        target[done] = *src;
        count += pick_k_from_n(target, src+1, k-1, n-1, done+1);
        }

return count;
}

int main(int argc, char **argv) {

char result[20];
char *domain = "OmgWtf!";
unsigned cnt ,len, want;
want = 3;

switch (argc) {
default:
case 3:
        domain = argv[2];
case 2:
        sscanf(argv[1], "%u", &want);
case 1:
        break;
        }
len = strlen(domain);

cnt = pick_k_from_n(result, domain, want, len, 0);

fprintf(stderr, "Count=%u\n", cnt);

return 0;
}

Removing the recursion is left as an exercise to the reader. Some output:

plasser@pisbak:~/hiero/src$ ./a.out 3 ABBA
BBA
ABA
ABA
ABB
Count=4
plasser@pisbak:~/hiero/src$

Upvotes: 0

wildplasser

Reputation: 44250

The original list X has n items. There are 2**n possible sublists, since every item will or will not appear in the resulting sublist: each item adds a bit to the enumeration of the possible sublists. You could view this enumeration of a bitword of n bits.

Since your are only want sublists with k items, you are interested in bitwords with exactly k bits set.

A practical algorithm could pick (or pick not) the first element from X, and then recurse into the rightmost n-1 substring of X, taking into account the accumulated number of chosen items. Since the X list is processed in order, the Y list will also be in order.

Upvotes: 0

timxyz

Reputation: 418

For the first index in Y, the distribution of indices in X is given by:

P(x; n, k) = binomial(n - x + k - 2, k - 1) / norm

where binomial denotes calculation of the binomial coefficient, and norm is a normalisation factor, equal to the total number of possible sublist configurations.

norm = binomial(n + k - 1, k)

So for k = 5 and n = 10 we have:

norm = 2002
P(x = 0) = 0.357, P(x <= 0) = 0.357
P(x = 1) = 0.245, P(x <= 1) = 0.604
P(x = 2) = 0.165, P(x <= 2) = 0.769
P(x = 3) = 0.105, P(x <= 3) = 0.874
P(x = 4) = 0.063, P(x <= 4) = 0.937
... (we can continue this up to x = 10)

We can sample the X index of the first item in Y from this distribution (call it x1). The distribution of the second index in Y can then be sampled in the same way with P(x; (n - x1), (k - 1)), and so on for all subsequent indices.

My feeling now is that the problem is not solvable in O(k), because in general we are unable to sample from the distribution described in constant time. If k = 2 then we can solve in constant time using the quadratic formula (because the probability function simplifies to 0.5(x^2 + x)) but I can't see a way to extend this to all k (my maths isn't great though).

Upvotes: 0

oldboy

Reputation: 514

By your first algorithm, it suffices to generate k uniform random samples of [0, 1) in sorted order.

Let X1, ..., Xk be these samples. Given that Xk = x, the conditional distribution of X1, ..., Xk-1 is k - 1 uniform random samples of [0, x) in sorted order, so it suffices to sample Xk and recurse.

What's the probability that Xk < x? Each of k independent samples of [0, 1) must be less than x, so the answer (the cumulative distribution function for Xk) is x^k. To sample according to the cdf, all we have to do is invert it on a uniform random sample of [0, 1): pow(random(), 1.0 / k).

Here's an (expected) O(k) algorithm I actually would consider implementing. The idea is to dump the samples into k bins, sort each bin, and concatenate. Here's some untested Python:

def samples(n, k):
    bins = [[] for i in range(k)]
    for i in range(k):
        x = randrange(n)
        bins[(x * k) // n].append(x)
    result = []
    for bin in bins:
        bin.sort()
        result.extend(bin)
    return result

Why is this efficient in expectation? Let's suppose we use insertion sort on each bin (each bin has expected size O(1)!). On top of operations that are O(k), we're going to pay proportionally to the number of sum of the squares of the bin sizes, which is basically the number of collisions. Since the probability of two samples colliding is at most something like 4/k and we have O(k^2) pairs of samples, the expected number of collisions is O(k).

I suspect rather strongly that the O(k) guarantee can be made with high probability.

Upvotes: 1

Ivaylo Strandjev

Reputation: 71009

You can use counting sort to sort Y and thus make the sorting linear with respect to k. However for that you need one additional array of length n. If we assume you have already allocated that, you may execute the code you are asking for arbitrary many times with complexity O(k).

The idea is just as you describe, but I will use one more array cnt of size n that I assume is initialized to 0, and another "stack" st that I assume is empty.

for i = 1 to k
    A[i] = floor(rand * n) + 1
    cnt[A[i]]+=1
    if cnt[A[i]] == 1  // Needed to be able to traverse the inserted elements faster
      st.push(A[i])

for elem in st
  for i = 0 to cnt[elem]
    Y.add(X[elem])

for elem in st
  cnt[elem] = 0

EDIT: as mentioned by oldboy what I state in the post is not true - I still have to sort st, which might be a bit better then the original proposition but not too much. So This approach will only be good if k is comparable to n and then we just iterate trough cnt linearly and construct Y this way. This way st is not needed:

for i = 1 to k
    A[i] = floor(rand * n) + 1
    cnt[A[i]]+=1

for i = 1 to k
  for j = 0 to cnt[i]
    Y.add(X[i])
  cnt[i] =0

Upvotes: 0

amit

Reputation: 178521

Unbiased O(n+k) solution is trivial, high-level pseudo code.

create an empty histogram of size n [initialized with all elements as zeros]
populate it with k uniformly distributed variables at range. (do k times histogram[inclusiveRand(1,n)]++)
iterate the initial list [A], while decreasing elements in the histogram and appending elements to the result list.

Explanation [edit]:

The idea is to chose k elements out of n at random, with uniform distribution for each, and create a histogram out of it.
This histogram now contains for each index i, how many times A[i] will appear in the resulting Y list.
Now, iterate the list A in-order, and for each element i, insert A[i] into the resulting Y list histogram[i] times.
This guarantees you maintain the order because you insert elements in order, and "never go back".
It also guarantees unbiased solution since for each i,j,K: P(histogram[i]=K) = P(histogram[j]=K), so for each K, each element has the same probability to appear in the resulting list K times.

I believe it can be done in O(k) using the order statistics [X_(i)] but I cannot figure it out though :\

Upvotes: 1

Generating random sublist from ordered list that maintains ordering

Answers (6)

Related Questions