Jacob Eggers
Jacob Eggers

Reputation: 9322

Random split list, keeping the original order in new lists

I am having a hard time formulating my question so I'll just show by example.

x = ['abc', 'c', 'w', 't', '3']
a, b = random_split(x, 3)      # first list should be length 3
# e.g. a => ['abc', 'w', 't']
# e.g. b => ['c', '3']

Is there an easy way of splitting a list into two random samples while maintaining the original ordering?


Edit: I know that I could use random.sample and then reorder, but I was hoping for an easy, simple, one line method.

Edit 2: Here's another solution, see if you can improve it:

def random_split(l, a_size):
    a, b = [], []
    m = len(l)
    which = ([a] * a_size) + ([b] * (m - a_size)) 
    random.shuffle(which)

    for array, sample in zip(which, l):
        array.append(sample)

    return a, b

Edit 3: My concern in avoiding sorting was that in the best case scenario it is O(N*log(N)). It should be possible to get a function that scales O(N) Unfortunately, none of the solutions posted so far actually achieve O(N) Though, after a little thought I found one that works and is comparable to @PedroWerneck's answer in performance. Though, I'm not 100% sure that is truly random.

def random_split(items, size):
  n = len(items)
  a, b = [], []
  for item in items:
    if size > 0 and random.random() < float(size)/n:
      b.append(item)
      size -= 1
    else:
      a.append(item)

    n -= 1

  return a, b

Upvotes: 3

Views: 4742

Answers (7)

hexparrot
hexparrot

Reputation: 3417

How about this approach. Random sample from the indexes and return two lists from two list comprehensions if in and if not in:

def random_split(lst, size):
    import random
    samp = set(random.sample(xrange(len(lst)),size))
    return ([v for i,v in enumerate(lst) if i in samp],
            [v for i,v in enumerate(lst) if i not in samp])

x = ['abc', 'c', 'w', 't', '3']

print random_split(x,3)

returns

(['abc', 't', '3'], ['c', 'w']) #random and retains order

Upvotes: 4

senderle
senderle

Reputation: 150977

Ok, there have been lots of interesting suggestions, one of which I inadvertently duplicated in a previous version of this post. But here are two solutions that have not been presented in this exact form:

def random_split(seq, n):
    indices = set(random.sample(range(len(seq)), n))
    left_right = ([], [])
    for n, x in enumerate(seq):
        left_right[n not in indices].append(x)
    return left_right

This does just one pass through the list and produces a uniformly random partition of the list, maintaining order. It's a refinement of hexparrot's suggestion, which was the one I inadvertently duplicated. You could use the ternary operator and two separate lists, but this seems a tad cleaner to me. Using enumerate allows this to handle non-hashable items, as well as sequences with duplicates.

def random_split(seq, n):
    rnd_bools = random.sample((0,) * n + (1,) * (len(seq) - n), len(seq))
    left_right = ([], [])
    for b, x in zip(rnd_bools, seq):
        left_right[b].append(x)
    return left_right

This one feels right to me. It's a refinement of Jacob Eggers second edit to the question. It's not very different, but instead of shuffling a list of lists, it shuffles a list of bools. I think it's a tad more comprehensible at first glance. It avoids the 2-line shuffle by using random.sample, which generates a copy; some may prefer the 2-line shuffle, and it's easily replaced.

Note that both of these work on the same basic principle: generate a sequence of bools and use them to index a left_right tuple; the first could easily be made almost identical to the second by pre-generating the boolean list.

Finally, the second solution can be converted into a profoundly ugly "one-liner" that I do not recommend -- obviously -- but that I display here for your amusement and ridicule:

random_split = lambda seq, n: reduce(lambda a, b: (a[0] + ([b[1]] if not b[0] else []), a[1] + ([b[1]] if b[0] else [])), zip(random.sample((0,) * n + (1,) * (len(seq) - n), len(seq)), seq), ([], []))

Upvotes: 3

Pedro Werneck
Pedro Werneck

Reputation: 41898

I believe it's impossible to do the limiting and no sorting after splitting while keeping the randomness in a simpler way than just sampling and reordering.

If there was no limit, it would be as random as the RNG can by by iterating over the list, and choosing randomly which destination list to send the values to:

>>> import random
>>> x = range(20)
>>> a = []
>>> b = []
>>> for v in x:
...     random.choice((a, b)).append(v)
... 
>>> a
[0, 2, 3, 4, 6, 7, 10, 12, 15, 17]
>>> b
[1, 5, 8, 9, 11, 13, 14, 16, 18, 19]

If you can deal with some bias, you can stop appending to the first list when it reaches the limit and still use the solution above. If you'll deal with small lists like in your example, it shouldn't be a big deal to retry it until you get the first list length right.

If you want it to be really random and be able to limit the first list size, then you'll have to give up and reorder at least one of the lists. The closest to a one liner implementation I can think is something like:

>>> x = range(20)
>>> b = x[:]
>>> a = sorted([b.pop(b.index(random.choice(b))) for n in xrange(limit)])
>>> a
[0, 1, 5, 10, 15, 16, 17]
>>> b
[2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 14, 18, 19]

You have to sort a, but b has the order kept.

edit

Now, do you really have to avoid reordering at all costs? Many neat solutions were posted, and your second solution is very nice, but none of them is simpler, easier and shorter than:

def random_split(items, size):
    sample = set(random.sample(items, size))
    return sorted(sample), sorted(set(items) - sample)

Even considering both sorting operations, I think it's hard to beat that one for simplicity and efficiency. Consider how optimized Python's Timsort is and how most other methods have to iterate over the n items at least once for each list.

If you really must avoid reordering, I guess this one also works and is very easy and simple, but iterates twice:

def random_split(items, size):
    sample = set(random.sample(items, size))
    a = [x for x in items if x in sample]
    b = [x for x in items if x not in sample]
    return a, b

This is essentially the same as Hexparrot's solution with the set(sample) suggested by senderle to make comparisons O(1), and removing the redundant index sample and enumerate calls. You don't need that if you deal only with hashable objects.

Upvotes: 4

Joel Cornett
Joel Cornett

Reputation: 24788

Here's something in a few lines:

from random import sample
x = ['abc', 'c', 'w', 't', '3']
sample_size = len(x) // 2

sample_set = set(sample(x, sample_size))
split_list = [[x[i] for i in subset] for subset in (sorted(sample_set), sorted(set(x) - sample_set))]

Upvotes: 0

6502
6502

Reputation: 114481

A variation on the shuffle-sort theme...

def random_split(L, size):
    index = range(len(L))
    random.shuffle(index)
    return ([L[i] for i in sorted(index[:size])],
            [L[i] for i in sorted(index[size:])])

Upvotes: 1

Casey Kuball
Casey Kuball

Reputation: 7955

I'm taking a guess that your random_split isn't supposed to give repeat elements.

If you don't have any duplicates in the original list, this will work as a one-liner as you have it in the original post, but it uses sorting. It's a very simple, if inefficient method of doing it:

import random

x = ['abc', 'c', 'w', 't', '3']

def random_split(x, n):
    k = x[:]
    random.shuffle(k)
    yield sorted(k[:n], key = x.index)
    yield sorted(k[n:], key = x.index)

a, b = random_split(x, 3)

Example of results:

>>> a
['c', 'w', 't']
>>> b
['abc', '3']

Upvotes: 0

Ray Toal
Ray Toal

Reputation: 88378

Here is a transcript you can turn into a function:

>>> a = [10,20,30,40,50,60]
>>> keep = sorted(random.sample(range(len(a)),3))
>>> keep
[0, 3, 4]
>>> ([a[i] for i in keep], [a[i] for i in range(len(a)) if i not in keep])
([10, 40, 50], [20, 30, 60])

Upvotes: 1

Related Questions