Reputation: 1262

Python: Finding random k-subset partition for a given list

The following code generates all partitions of length k (k-subset partitions) for a given list. the algorithm could be found in this topic.

def algorithm_u(ns, m):
    def visit(n, a):
        ps = [[] for i in xrange(m)]
        for j in xrange(n):
            ps[a[j + 1]].append(ns[j])
        return ps

    def f(mu, nu, sigma, n, a):
        if mu == 2:
            yield visit(n, a)
        else:
            for v in f(mu - 1, nu - 1, (mu + sigma) % 2, n, a):
                yield v
        if nu == mu + 1:
            a[mu] = mu - 1
            yield visit(n, a)
            while a[nu] > 0:
                a[nu] = a[nu] - 1
                yield visit(n, a)
        elif nu > mu + 1:
            if (mu + sigma) % 2 == 1:
                a[nu - 1] = mu - 1
            else:
                a[mu] = mu - 1
            if (a[nu] + sigma) % 2 == 1:
                for v in b(mu, nu - 1, 0, n, a):
                    yield v
            else:
                for v in f(mu, nu - 1, 0, n, a):
                    yield v
            while a[nu] > 0:
                a[nu] = a[nu] - 1
                if (a[nu] + sigma) % 2 == 1:
                    for v in b(mu, nu - 1, 0, n, a):
                        yield v
                else:
                    for v in f(mu, nu - 1, 0, n, a):
                        yield v

    def b(mu, nu, sigma, n, a):
        if nu == mu + 1:
            while a[nu] < mu - 1:
                yield visit(n, a)
                a[nu] = a[nu] + 1
            yield visit(n, a)
            a[mu] = 0
        elif nu > mu + 1:
            if (a[nu] + sigma) % 2 == 1:
                for v in f(mu, nu - 1, 0, n, a):
                    yield v
            else:
                for v in b(mu, nu - 1, 0, n, a):
                    yield v
            while a[nu] < mu - 1:
                a[nu] = a[nu] + 1
                if (a[nu] + sigma) % 2 == 1:
                    for v in f(mu, nu - 1, 0, n, a):
                        yield v
                else:
                    for v in b(mu, nu - 1, 0, n, a):
                        yield v
            if (mu + sigma) % 2 == 1:
                a[nu - 1] = 0
            else:
                a[mu] = 0
        if mu == 2:
            yield visit(n, a)
        else:
            for v in b(mu - 1, nu - 1, (mu + sigma) % 2, n, a):
                yield v

    n = len(ns)
    a = [0] * (n + 1)
    for j in xrange(1, m + 1):
        a[n - m + j] = j - 1
    return f(m, n, 0, n, a)

we know that number of k-subsets of a given list is equal to Stirling number and it could be very big for some large lists.

the code above returns a Python generator that could generate all possible k-subset partitions for the given list with calling its next method. accordingly, if I want to get only one of these partitions randomly, I have to either call next method for some random times (which makes it really slow if the Stirling number is big) or use the itertools.islice method to get a slice of size one which is really slow as before.

I'm trying to avoid listing all partitions because it would be waste of time and speed and even memory (because calculations are a lot and memory is important in my case).

the question is how can I generate only one of k-subset partitions without generating the rest ? or at least make the procedure very faster than what explained above. I need the performance because I need to get only one of them each time and I'm running the application for maybe more than ten million times.

I'd appreciate any help.

EDIT: EXAMPLE

list : { 1, 2, 3 }

for k = 3:

{ {1}, {2}, {3} }

for k = 2:

{ {1, 2}, {3} }
{ {1, 3}, {2} }
{ {1}, {2, 3} }

and for k = 1:

{ {1, 2, 3} }

consider k = 2, is there any way I can generate only one of these 3 partitions randomly, without generating the other 2? note that I want to generate random partition for any given k not only a random partition of any k which means if I set the k to 2 I would like to generate only one of these 3 not one of all 5.

Regards,

Mohammad

Upvotes: 20

Answers (3)

m69 ''snarky and unwelcoming''

Reputation: 12324

tl;dr:

The k-subset partitions for a given n and k can be divided into types, based on which elements are the first to go into as yet empty parts. Each of these types is represented by a bit pattern with n-1 bits of which k-1 are set. While the number of partitions is huge (given by the second Stirling number), the number of types is much smaller, e.g.:

n = 21, k = 8
number of partitions: S(21,8) = 132,511,015,347,084
number of types: (n-1 choose k-1) = 77,520

Calculating how many partitions there are of each type is simple, based on the position of the zeros in the bit pattern. If you make a list of all the types (by iterating over all n:k bit patterns) and keep a running total of the number of partitions, you can then use a binary search on this list to find the type of the partition with a given rank (in Log₂(n-1 choose k-1) steps; 17 for the example above), and then translate the bit pattern into a partition and calculate into which part each element goes (in n steps). Every part of this method can be done iteratively, requiring no recursion.

Here's a non-recursive solution. I've tried to roll my own, but it may (partially) overlap with Peter's answer or existing methods.

If you have a set of n elements, e.g. with n=8:

{a,b,c,d,e,f,g,h}

then k-subset partitions will take this shape, e.g. with k=5:

{a,e} {b,c,h} {d} {f} {g}

This partition can also be written as:

1,2,2,3,1,4,5,2

which lists which part each of the elements goes in. So this sequence of n digits with values from 1 to k represents a k-subset partition of n elements.

However, not all such sequences are valid partitions; every digit from 1 to k must be present, otherwise there would be empty parts:

1,2,2,3,1,3,5,2 → {a,e} {b,c,h} {d,f} {} {g}

Also, to avoid duplicates, digit x can only be used after digit x-1 has been used. So the first digit is always 1, the second can be at most 2, and so on. If in the example we use digits 4 and 5 before 3, we get duplicate partitions:

1,2,2,3,1,4,5,2 → {a,e} {b,c,h} {d} {f} {g}
1,2,2,4,1,5,3,2 → {a,e} {b,c,h} {g} {d} {f}

When you group the partitions based on when each part is first used, you get these types:

1,1,1,1,2,3,4,5               0001111    11111111    1          1
1,1,1,2,12,3,4,5              0010111    11112111    2          2
1,1,1,2,3,123,4,5             0011011    11111311    3          3
1,1,1,2,3,4,1234,5            0011101    11111141    4          4
1,1,1,2,3,4,5,12345           0011110    11111115    5          5
1,1,2,12,12,3,4,5             0100111    11122111    2*2        4
1,1,2,12,3,123,4,5            0101011    11121311    2*3        6
1,1,2,12,3,4,1234,5           0101101    11121141    2*4        8
1,1,2,12,3,4,5,12345          0101110    11121115    2*5       10
1,1,2,3,123,123,4,5           0110011    11113311    3*3        9
1,1,2,3,123,4,1234,5          0110101    11113141    3*4       12
1,1,2,3,123,4,5,12345         0110110    11113115    3*5       15
1,1,2,3,4,1234,1234,5         0111001    11111441    4*4       16
1,1,2,3,4,1234,5,12345        0111010    11111415    4*5       20
1,1,2,3,4,5,12345,12345       0111100    11111155    5*5       25
1,2,12,12,12,3,4,5            1000111    11222111    2*2*2      8
1,2,12,12,3,123,4,5           1001011    11221311    2*2*3     12
1,2,12,12,3,4,1234,5          1001101    11221141    2*2*4     16
1,2,12,12,3,4,5,12345         1001110    11221115    2*2*5     20
1,2,12,3,123,123,4,5          1010011    11213311    2*3*3     18
1,2,12,3,123,4,1234,5         1010101    11213141    2*3*4     24
1,2,12,3,123,4,5,12345        1010110    11213115    2*3*5     30
1,2,12,3,4,1234,1234,5        1011001    11211441    2*4*4     32
1,2,12,3,4,1234,5,12345       1011010    11211415    2*4*5     40
1,2,12,3,4,5,12345,12345      1011100    11211155    2*5*5     50
1,2,3,123,123,123,4,5         1100011    11133311    3*3*3     27
1,2,3,123,123,4,1234,5        1100101    11133141    3*3*4     36
1,2,3,123,123,4,5,12345       1100110    11133115    3*3*5     45
1,2,3,123,4,1234,1234,5       1101001    11131441    3*4*4     48
1,2,3,123,4,1234,5,12345      1101010    11131415    3*4*5     60
1,2,3,123,4,5,12345,12345     1101100    11131155    3*5*5     75
1,2,3,4,1234,1234,1234,5      1110001    11114441    4*4*4     64
1,2,3,4,1234,1234,5,12345     1110010    11114415    4*4*5     80
1,2,3,4,1234,5,12345,12345    1110100    11114155    4*5*5    100
1,2,3,4,5,12345,12345,12345   1111000    11111555    5*5*5    125    SUM = 1050

In the above diagram, a partition of the type:

1,2,12,3,123,4,1234,5

means that:

a goes into part 1
b goes into part 2
c goes into part 1 or 2
d goes into part 3
e goes into part 1, 2 or 3
f goes into part 4
g goes into part 1, 2, 3 or 4
h goes into part 5

So partitions of this type have a digit that can have 2 values, a digit that can have 3 values, and a digit that can have 4 values (this is indicated in the third column in the diagram above). So there are a total of 2 × 3 × 4 partitions of this type (as indicated in columns 4 and 5). The sum of these is of course the Stirling number: S(8,5) = 1050.

The second column in the diagram is another way of notating the type of the partition: after starting with 1, every digit is either a digit that has been used before, or a step up (i.e. the highest digit used so far + 1). If we represent these two options by 0 and 1, we get e.g.:

1,2,12,3,123,4,1234,5 → 1010101

where 1010101 means:

Start with 1
1 → step up to 2
0 → repeat 1 or 2
1 → step up to 3
0 → repeat 1, 2 or 3
1 → step up to 4
0 → repeat 1, 2, 3 or 4
1 → step up to 5

So every binary sequence with n-1 digits and k-1 ones represents a type of partition. We can calculate the number of partitions of a type by iterating over the digits from left to right, incrementing a factor when we find a one, and multiplying with the factor when we find a zero, e.g.:

1,2,12,3,123,4,1234,5 → 1010101
Start with product = 1, factor = 1
1 → increment factor: 2
0 → product × factor = 2
1 → increment factor: 3
0 → product × factor = 6
1 → increment factor: 4
0 → product × factor = 24
1 → increment factor: 5

And again for this example, we find that there are 24 partitions of this type. So, counting the partitions of each type can be done by iterating over all n-1-digit integers with k-1 digits set, using any method (e.g. Gosper's Hack):

0001111      1       1
0010111      2       3
0011011      3       6
0011101      4      10
0011110      5      15
0100111      4      19
0101011      6      25
0101101      8      33
0101110     10      43
0110011      9      52
0110101     12      64
0110110     15      79
0111001     16      95
0111010     20     115
0111100     25     140
1000111      8     148
1001011     12     160
1001101     16     176
1001110     20     196
1010011     18     214
1010101     24     238
1010110     30     268
1011001     32     300
1011010     40     340
1011100     50     390
1100011     27     417
1100101     36     453
1100110     45     498
1101001     48     546
1101010     60     606
1101100     75     681
1110001     64     745
1110010     80     825
1110100    100     925
1111000    125    1050

Finding a random partition then means choosing a number from 1 to S(n,k), going over the counts per partition type while keeping a running total (column 3 above), and picking the corresponding partition type, and then calculating the value of the repeated digits, e.g.:

S(8,5) = 1050
random pick: e.g. 333
type: 1011010 → 1,2,12,3,4,1234,5,12345
range: 301 - 340
variation: 333 - 301 = 32
digit options: 2, 4, 5
digit values: 20, 5, 1
variation: 32 = 1 × 20 + 2 × 5 + 2 × 1
digits: 1, 2, 2 (0-based) → 2, 3, 3 (1-based)
partition: 1,2,2,3,4,3,5,3

and the 333rd partition of 8 elements in 5 parts is:

1,2,2,3,4,3,5,3 → {a} {b,c} {d,f,h} {e} {g}

There are a number of options to turn this into code; if you store the n-1-digit numbers as a running total, you can do subsequent lookups using a binary search over the list, whose length is C(n-1,k-1), to reduce time complexity from O(C(n-1,k-1)) to O(Log₂(C(n-1,k-1))).

I've made a first test in JavaScript (sorry, I don't speak Python); it's not pretty but it demonstrates the method and is quite fast. The example is for the case n=21 and k=8; it creates the count table for 77,520 types of partitions, returns the total number of partitions 132,511,015,347,084 and then retrieves 10 randomly picked partitions within that range. On my computer this code returns a million randomly selected partitions in 3.7 seconds. (note: the code is zero-based, unlike the explanation above)

function kSubsetPartitions(n, k) { // Constructor
    this.types = [];
    this.count = [];
    this.total = 0;
    this.elems = n;

    var bits = (1 << k - 1) - 1, done = 1 << n - 1;
    do {
        this.total += variations(bits);
        this.types.push(bits);
        this.count.push(this.total);
    }
    while (!((bits = next(bits)) & done));

    function variations(bits) {
        var product = 1, factor = 1, mask = 1 << n - 2;
        while (mask) {
            if (bits & mask) ++factor;
            else product *= factor;
            mask >>= 1;
        }
        return product;
    }
    function next(a) { // Gosper's Hack
        var c = (a & -a), r = a + c;
        return (((r ^ a) >> 2) / c) | r;
    }
}
kSubsetPartitions.prototype.partition = function(rank) {
    var range = 1, type = binarySearch(this.count, rank);
    if (type) {
        rank -= this.count[type - 1];
        range = this.count[type] - this.count[type - 1];
    }
    return translate(this.types[type], this.elems, range, rank);

    // This translates the bit pattern format and creates the correct partition
    // for the given rank, using a letter format for demonstration purposes
    function translate(bits, len, range, rank) {
        var partition = [["A"]], part, max = 0, mask = 1 << len - 2;
        for (var i = 1; i < len; i++, mask >>= 1) {
            if (!(bits & mask)) {
                range /= (max + 1);
                part = Math.floor(rank / range);
                rank %= range;
            }
            else part = ++max;
            if (!partition[part]) partition[part] = "";
            partition[part] += String.fromCharCode(65 + i);
        }
        return partition.join(" / ");
    }
    function binarySearch(array, value) {
        var low = 0, mid, high = array.length - 1;
        while (high - low > 1) {
            mid = Math.ceil((high + low) / 2);
            if (value < array[mid]) high = mid;
            else low = mid;
        }
        return value < array[low] ? low : high;
    }
}

var ksp = new kSubsetPartitions(21, 8);
document.write("Number of k-subset partitions for n,k = 21,8 &rarr; " + 
    ksp.total.toLocaleString("en-US") + "<br>");
for (var tests = 10; tests; tests--) {
    var rnd = Math.floor(Math.random() * ksp.total);
    document.write("Partition " + rnd.toLocaleString("en-US", {minimumIntegerDigits: 
        15}) + " &rarr; " + ksp.partition(rnd) + "<br>");
}

It isn't really necessary to store the bit patterns for each partition type, because they can be recreated from their index (see e.g. the second algorithm in this answer). If you only store the running total of the number of variations per partition type, that halves the memory requirement.

This second code example in C++ stores only the counts, and returns the partition as an n-length array containing the part number for each element. Usage example at the end of the code. On my computer it creates the count list for n=40 and k=32 in 12 seconds and then returns 10 million partitions in 24 seconds.

Values of n can go up to 65 and k up to 64, but for some combinations the number of partitions will be greater than 2⁶⁴, which this code obviously can't handle. If you translate it into Python, there should be no such restrictions. (Note: enable zero check in binomial coefficient function if k=1.)

class kSubsetPartitions {

    std::vector <uint64_t> count;
    uint64_t total;
    uint8_t n;
    uint8_t k;

    public:

    kSubsetPartitions(uint8_t n, uint8_t k) {
        this->total = 0;
        this->n = n;
        this->k = k;
        uint64_t bits = ((uint64_t) 1 << k - 1) - 1;
        uint64_t types = choose(n - 1, k - 1);

        this->count.reserve(types);
        while (types--) {
            this->total += variations(bits);
            this->count.push_back(this->total);
            bits = next(bits);
        }
    }

    uint64_t range() {
        return this->total;
    }

    void partition(uint64_t rank, uint8_t *buffer) {
        uint64_t range = 1;
        uint64_t type = binarySearch(rank);

        if (type) {
            rank -= this->count[type - 1];
            range = this->count[type] - this->count[type - 1];
        }
        format(pattern(type), range, rank, buffer);
    }

    private:

    uint64_t pattern(uint64_t type) {
        uint64_t steps, bits = 0, mask = (uint64_t) 1 << this->n - 2;
        uint8_t ones = this->k - 1;

        for (uint8_t i = this->n - 1; i; i--, mask >>= 1) {
            if (i > ones) {
                steps = choose(i - 1, ones);
                if (type >= steps) {
                    type -= steps;
                    bits |= mask;
                    --ones;
                }
            }
            else bits |= mask;
        }
        return bits;
    }

    uint64_t choose(uint8_t x, uint8_t y) { // C(x,y) using Pascal's Triangle
        static std::vector <std::vector <uint64_t> > triangle;

        if (triangle.empty()) {
            triangle.resize(this->n);
            triangle[0].push_back(1);
            for (uint8_t i = 1; i < this->n; i++) {
                triangle[i].push_back(1);
                for (uint8_t j = 1; j < i; j++) {
                    triangle[i].push_back(triangle[i - 1][j - 1] + triangle[i - 1][j]);
                }
                triangle[i].push_back(1);
            }
        }
        return triangle[x][y];
    }

    void format(uint64_t bits, uint64_t range, uint64_t rank, uint8_t *buffer) {
        uint64_t mask = (uint64_t) 1 << this->n - 2;
        uint8_t max = 0, part;

        *buffer = 0;
        while (mask) {
            if (!(bits & mask)) {
                range /= (max + 1);
                part = rank / range;
                rank %= range;
            }
            else part = ++max;
            *(++buffer) = part;
            mask >>= 1;
        }
    }

    uint64_t binarySearch(uint64_t rank) {
        uint64_t low = 0, mid, high = this->count.size() - 1;

        while (high - low > 1) {
            mid = (high + low + 1) / 2;
            if (rank < this->count[mid]) high = mid;
            else low = mid;
        }
        return rank < this->count[low] ? low : high;
    }

    uint64_t variations(uint64_t bits) {
        uint64_t product = 1;
        uint64_t mask = (uint64_t) 1 << this->n - 2;
        uint8_t factor = 1;

        while (mask) {
            if (bits & mask) ++factor;
            else product *= factor;
            mask >>= 1;
        }
        return product;
    }

    uint64_t next(uint64_t a) { // Gosper's Hack
        // if (!a) return a;    // k=1 => a=0 => c=0 => division by zero!
        uint64_t c = (a & -a), r = a + c;
        return (((r ^ a) >> 2) / c) | r;
    }
};

// USAGE EXAMPLE:
// uint8_t buffer[40];
// kSubsetPartitions* ksp = new kSubsetPartitions(40, 32);
// uint64_t range = ksp->range();
// ksp->partition(any_integer_below_range, buffer);

Below is an overview of the values of n and k that result in more than 2⁶⁴ partitions, and cause overflow in the code above. Up to n=26, all values of k give valid results.

25:  -     26:  -     27: 8-13   28: 7-15   29: 6-17   30: 6-18   31: 5-20   32: 5-21
33: 5-22   34: 4-23   35: 4-25   36: 4-26   37: 4-27   38: 4-28   39: 4-29   40: 4-31
41: 4-32   42: 4-33   43: 3-34   44: 3-35   45: 3-36   46: 3-37   47: 3-38   48: 3-39
49: 3-40   50: 3-42   51: 3-43   52: 3-44   53: 3-45   54: 3-46   55: 3-47   56: 3-48
57: 3-49   58: 3-50   59: 3-51   60: 3-52   61: 3-53   62: 3-54   63: 3-55   64: 3-56
65: 3-57

A version which doesn't store the number of partitions per type is possible, and would require almost no memory. Looking up the partitions that correspond to randomly selected integers would be slower, but if the selection of integers was sorted, it could be even faster than the version which requires binary sort for every lookup.

You'd start with the first bit pattern, calculate the number of partitions of this type, see if the first integer(s) fall into this range, calculate their partitions, and then move on to the next bit pattern.

Upvotes: 6

wildwilhelm

Reputation: 5019

How about something like this:

import itertools
import random

def random_ksubset(ls, k):
    # we need to know the length of ls, so convert it into a list
    ls = list(ls)
    # sanity check
    if k < 1 or k > len(ls):
        return []
    # Create a list of length ls, where each element is the index of
    # the subset that the corresponding member of ls will be assigned
    # to.
    #
    # We require that this list contains k different values, so we
    # start by adding each possible different value.
    indices = list(range(k))
    # now we add random values from range(k) to indices to fill it up
    # to the length of ls
    indices.extend([random.choice(list(range(k))) for _ in range(len(ls) - k)])
    # shuffle the indices into a random order
    random.shuffle(indices)
    # construct and return the random subset: sort the elements by
    # which subset they will be assigned to, and group them into sets
    return [{x[1] for x in xs} for (_, xs) in
            itertools.groupby(sorted(zip(indices, ls)), lambda x: x[0])]

This produces random k-subset partitions like so:

>>> ls = {1,2,3}
>>> print(random_ksubset(ls, 2))
[set([1, 2]), set([3])]
>>> print(random_ksubset(ls, 2))
[set([1, 3]), set([2])]
>>> print(random_ksubset(ls, 2))
[set([1]), set([2, 3])]
>>> print(random_ksubset(ls, 2))
[set([1]), set([2, 3])]

This method satisfies OP's requirement of getting one randomly-generated partition, without enumerating all possible partitions. Memory complexity here is linear. Run-time complexity is O(N log N) due to the sort. I suppose it might be possible to get this down to linear, if that was important, using a more complicated method of constructing the return value.

As @Leon points out, this satisfies the requirements of his option 2 in trying to define the problem. What this won't do is deterministically generate partition #N (this is Leon's option 1, which would allow you to randomly pick an integer N and then retrieve the corresponding partition). Leon's clarification is important, because, to satisfy the spirit of the question, every possible partition of the collection should be generated with equal probability. On our toy problem, this is the case:

>>> from collections import Counter
>>> Counter(frozenset(map(frozenset, random_ksubset(ls, 2))) for _ in range(10000))
Counter({frozenset({frozenset({2, 3}), frozenset({1})}): 3392,
         frozenset({frozenset({1, 3}), frozenset({2})}): 3212,
         frozenset({frozenset({1, 2}), frozenset({3})}): 3396})

However. In general, this method does not generate each partition with equal probability. Consider:

>>> Counter(frozenset(map(frozenset, random_ksubset(range(4), 2)))
...         for _ in range(10000)).most_common()
[(frozenset({frozenset({1, 3}), frozenset({0, 2})}), 1671),
 (frozenset({frozenset({1, 2}), frozenset({0, 3})}), 1667),
 (frozenset({frozenset({2, 3}), frozenset({0, 1})}), 1642),
 (frozenset({frozenset({0, 2, 3}), frozenset({1})}), 1285),
 (frozenset({frozenset({2}), frozenset({0, 1, 3})}), 1254),
 (frozenset({frozenset({0, 1, 2}), frozenset({3})}), 1245),
 (frozenset({frozenset({1, 2, 3}), frozenset({0})}), 1236)]

We can see here that we are more likely to generate "more balanced" partitions (because there are more ways to construct these). The partitions that contain singleton sets are produced less frequently.

It seems that an efficient uniform sampling method over k-partitions of sets is sort of an unsolved research question (also see mathoverflow). Nijenhuis and Wilf give code for sampling from all partitions (Chapter 12), which could work with rejection testing, and @PeterdeRivaz's answer can also uniformly sample a k-partition. The drawback with both of these methods is that they require computing the Stirling numbers, which grow exponentially in n, and the algorithms are recursive, which I think will make them slow on large inputs. As you mention "millions" of partitions in your comment, I think that these approaches will only be tractable up to a certain input size.

A. Nijenhuis and H. Wilf. Combinatorial Algorithms for Computers and Calculators. Academic Press, Orlando FL, second edition, 1978.

Exploring Leon's option 1 might be interesting. Here's a rough procedure to deterministically produce a particular partition of a collection using @Amadan's suggestion of taking an integer value interpreted as a k-ary number. Note that not every integer value produces a valid k-subset partition (because we disallow empty subsets):

def amadan(ls, N, k):
    """
    Given a collection `ls` with length `b`, a value `k`, and a
    "partition number" `N` with 0 <= `N` < `k**b`, produce the Nth
    k-subset paritition of `ls`.
    """
    ls = list(ls)
    b = len(ls)
    if not 0 <= N < k**b: return None
    # produce the k-ary index vector from the number N
    index = []
    # iterate through each of the subsets
    for _ in range(b):
        index.append(N % k)
        N //= k
    # subsets cannot be empty
    if len(set(index)) != k: return None
    return frozenset(frozenset(x[1] for x in xs) for (_, xs) in
                     itertools.groupby(sorted(zip(index, ls)),
                                       lambda x:x[0]))

We can confirm that this generates the Stirling numbers properly:

>>> for i in [(4,1), (4,2), (4,3), (4,4), (5,1), (5,2), (5,3), (5,4), (5,5)]:
...     b,k = i
...     r = [amadan(range(b), N, k) for N in range(k**b)]
...     r = [x for x in r if x is not None]
...     print(i, len(set(r)))
(4, 1) 1
(4, 2) 7
(4, 3) 6
(4, 4) 1
(5, 1) 1
(5, 2) 15
(5, 3) 25
(5, 4) 10
(5, 5) 1

This may also be able to produce each possible partition with equal probability; I'm not quite sure. Here's a test case, where it works:

>>> b,k = 4,3
>>> r = [amadan(range(b), N, k) for N in range(k**b)]
>>> r = [x for x in r if x is not None]
>>> print(Counter([' '.join(sorted(''.join(map(str, x)) for x in p)) for p in r]))
Counter({'0 13 2': 6,
         '01 2 3': 6,
         '0 12 3': 6,
         '03 1 2': 6,
         '02 1 3': 6,
         '0 1 23': 6})

Another working case:

>>> b,k = 5,4
>>> r = [amadan(range(b), N, k) for N in range(k**b)]
>>> r = [x for x in r if x is not None]
>>> print(Counter([' '.join(sorted(''.join(map(str, x)) for x in p)) for p in r]))
Counter({'0 12 3 4': 24,
         '04 1 2 3': 24,
         '0 1 23 4': 24,
         '01 2 3 4': 24,
         '03 1 2 4': 24,
         '0 13 2 4': 24,
         '0 1 24 3': 24,
         '02 1 3 4': 24,
         '0 1 2 34': 24,
         '0 14 2 3': 24})

So, to wrap this up in a function:

def random_ksubset(ls, k):
    ls = list(ls)
    maxn = k**len(ls)-1
    rv = None
    while rv is None:
        rv = amadan(ls, random.randint(0, maxn), k)
    return rv

And then we can do:

>>> random_ksubset(range(3), 2)
frozenset({frozenset({2}), frozenset({0, 1})})
>>> random_ksubset(range(3), 2)
frozenset({frozenset({1, 2}), frozenset({0})})
>>> random_ksubset(range(3), 2)
frozenset({frozenset({1, 2}), frozenset({0})})
>>> random_ksubset(range(3), 2)
frozenset({frozenset({2}), frozenset({0, 1})})

Upvotes: 6

Peter de Rivaz

Reputation: 33509

You can count Stirling numbers efficiently with a recursive algorithm by storing previously computed values:

fact=[1]

def nCr(n,k):
    """Return number of ways of choosing k elements from n"""
    while len(fact)<=n:
        fact.append(fact[-1]*len(fact))
    return fact[n]/(fact[k]*fact[n-k])

cache = {}
def count_part(n,k):
    """Return number of ways of partitioning n items into k non-empty subsets"""
    if k==1:
        return 1
    key = n,k
    if key in cache:
        return cache[key]
    # The first element goes into the next partition
    # We can have up to y additional elements from the n-1 remaining
    # There will be n-1-y left over to partition into k-1 non-empty subsets
    # so n-1-y>=k-1
    # y<=n-k
    t = 0
    for y in range(0,n-k+1):
        t += count_part(n-1-y,k-1) * nCr(n-1,y)
    cache[key] = t
    return t

Once you know how many choices there are, you can adapt this recursive code to generate a particular partition:

def ith_subset(A,k,i):
    """Return ith k-subset of A"""
    # Choose first element x
    n = len(A)
    if n==k:
        return A
    if k==0:
        return []
    for x in range(n):
        # Find how many cases are possible with the first element being x
        # There will be n-x-1 left over, from which we choose k-1
        extra = nCr(n-x-1,k-1)
        if i<extra:
            break
        i -= extra
    return [A[x]] + ith_subset(A[x+1:],k-1,i)

def gen_part(A,k,i):
    """Return i^th k-partition of elements in A (zero-indexed) as list of lists"""
    if k==1:
        return [A]
    n=len(A)
    # First find appropriate value for y - the extra amount in this subset
    for y in range(0,n-k+1):
        extra = count_part(n-1-y,k-1) * nCr(n-1,y)
        if i<extra:
            break
        i -= extra
    # We count through the subsets, and for each subset we count through the partitions
    # Split i into a count for subsets and a count for the remaining partitions
    count_partition,count_subset = divmod(i,nCr(n-1,y))
    # Now find the i^th appropriate subset
    subset = [A[0]] + ith_subset(A[1:],y,count_subset)
    S=set(subset)
    return  [subset] + gen_part([a for a in A if a not in S],k-1,count_partition)

As an example, I've written a test program that produces different partitions of 4 numbers:

def test(A):
    n=len(A)
    for k in [1,2,3,4]:
        t = count_part(n,k)
        print k,t
        for i in range(t):
            print " ",i,gen_part(A,k,i)

test([1,2,3,4])

This code prints:

1 1
  0 [[1, 2, 3, 4]]
2 7
  0 [[1], [2, 3, 4]]
  1 [[1, 2], [3, 4]]
  2 [[1, 3], [2, 4]]
  3 [[1, 4], [2, 3]]
  4 [[1, 2, 3], [4]]
  5 [[1, 2, 4], [3]]
  6 [[1, 3, 4], [2]]
3 6
  0 [[1], [2], [3, 4]]
  1 [[1], [2, 3], [4]]
  2 [[1], [2, 4], [3]]
  3 [[1, 2], [3], [4]]
  4 [[1, 3], [2], [4]]
  5 [[1, 4], [2], [3]]
4 1
  0 [[1], [2], [3], [4]]

As another example, there are 10 million partitions of 1,2,3,..14 into 4 parts. This code can generate all partitions in 44 seconds with pypy.

There are 50,369,882,873,307,917,364,901 partitions of 1,2,3,...,40 into 4 parts. This code can generate 10 million of these in 120 seconds with pypy running on a single processor.

To tie things together, you can use this code to generate a single random partition of a list A into k non-empty subsets:

import random
def random_ksubset(A,k):
    i = random.randrange(0,count_part(len(A),k))
    return gen_part(A,k,i)

Upvotes: 11

Python: Finding random k-subset partition for a given list

Answers (3)

Related Questions