Reputation: 9322
I am having a hard time formulating my question so I'll just show by example.
x = ['abc', 'c', 'w', 't', '3']
a, b = random_split(x, 3) # first list should be length 3
# e.g. a => ['abc', 'w', 't']
# e.g. b => ['c', '3']
Is there an easy way of splitting a list into two random samples while maintaining the original ordering?
Edit: I know that I could use random.sample and then reorder, but I was hoping for an easy, simple, one line method.
Edit 2: Here's another solution, see if you can improve it:
def random_split(l, a_size):
a, b = [], []
m = len(l)
which = ([a] * a_size) + ([b] * (m - a_size))
random.shuffle(which)
for array, sample in zip(which, l):
array.append(sample)
return a, b
Edit 3: My concern in avoiding sorting was that in the best case scenario it is O(N*log(N))
. It should be possible to get a function that scales O(N)
Unfortunately, none of the solutions posted so far actually achieve O(N)
Though, after a little thought I found one that works and is comparable to @PedroWerneck's answer in performance. Though, I'm not 100% sure that is truly random.
def random_split(items, size):
n = len(items)
a, b = [], []
for item in items:
if size > 0 and random.random() < float(size)/n:
b.append(item)
size -= 1
else:
a.append(item)
n -= 1
return a, b
Upvotes: 3
Views: 4742
Reputation: 3417
How about this approach. Random sample from the indexes and return two lists from two list comprehensions if in and if not in:
def random_split(lst, size):
import random
samp = set(random.sample(xrange(len(lst)),size))
return ([v for i,v in enumerate(lst) if i in samp],
[v for i,v in enumerate(lst) if i not in samp])
x = ['abc', 'c', 'w', 't', '3']
print random_split(x,3)
returns
(['abc', 't', '3'], ['c', 'w']) #random and retains order
Upvotes: 4
Reputation: 150977
Ok, there have been lots of interesting suggestions, one of which I inadvertently duplicated in a previous version of this post. But here are two solutions that have not been presented in this exact form:
def random_split(seq, n):
indices = set(random.sample(range(len(seq)), n))
left_right = ([], [])
for n, x in enumerate(seq):
left_right[n not in indices].append(x)
return left_right
This does just one pass through the list and produces a uniformly random partition of the list, maintaining order. It's a refinement of hexparrot's suggestion, which was the one I inadvertently duplicated. You could use the ternary operator and two separate lists, but this seems a tad cleaner to me. Using enumerate
allows this to handle non-hashable items, as well as sequences with duplicates.
def random_split(seq, n):
rnd_bools = random.sample((0,) * n + (1,) * (len(seq) - n), len(seq))
left_right = ([], [])
for b, x in zip(rnd_bools, seq):
left_right[b].append(x)
return left_right
This one feels right to me. It's a refinement of Jacob Eggers second edit to the question. It's not very different, but instead of shuffling a list of lists, it shuffles a list of bools. I think it's a tad more comprehensible at first glance. It avoids the 2-line shuffle by using random.sample
, which generates a copy; some may prefer the 2-line shuffle, and it's easily replaced.
Note that both of these work on the same basic principle: generate a sequence of bools and use them to index a left_right
tuple; the first could easily be made almost identical to the second by pre-generating the boolean list.
Finally, the second solution can be converted into a profoundly ugly "one-liner" that I do not recommend -- obviously -- but that I display here for your amusement and ridicule:
random_split = lambda seq, n: reduce(lambda a, b: (a[0] + ([b[1]] if not b[0] else []), a[1] + ([b[1]] if b[0] else [])), zip(random.sample((0,) * n + (1,) * (len(seq) - n), len(seq)), seq), ([], []))
Upvotes: 3
Reputation: 41898
I believe it's impossible to do the limiting and no sorting after splitting while keeping the randomness in a simpler way than just sampling and reordering.
If there was no limit, it would be as random as the RNG can by by iterating over the list, and choosing randomly which destination list to send the values to:
>>> import random
>>> x = range(20)
>>> a = []
>>> b = []
>>> for v in x:
... random.choice((a, b)).append(v)
...
>>> a
[0, 2, 3, 4, 6, 7, 10, 12, 15, 17]
>>> b
[1, 5, 8, 9, 11, 13, 14, 16, 18, 19]
If you can deal with some bias, you can stop appending to the first list when it reaches the limit and still use the solution above. If you'll deal with small lists like in your example, it shouldn't be a big deal to retry it until you get the first list length right.
If you want it to be really random and be able to limit the first list size, then you'll have to give up and reorder at least one of the lists. The closest to a one liner implementation I can think is something like:
>>> x = range(20)
>>> b = x[:]
>>> a = sorted([b.pop(b.index(random.choice(b))) for n in xrange(limit)])
>>> a
[0, 1, 5, 10, 15, 16, 17]
>>> b
[2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 14, 18, 19]
You have to sort a, but b has the order kept.
edit
Now, do you really have to avoid reordering at all costs? Many neat solutions were posted, and your second solution is very nice, but none of them is simpler, easier and shorter than:
def random_split(items, size):
sample = set(random.sample(items, size))
return sorted(sample), sorted(set(items) - sample)
Even considering both sorting operations, I think it's hard to beat that one for simplicity and efficiency. Consider how optimized Python's Timsort is and how most other methods have to iterate over the n items at least once for each list.
If you really must avoid reordering, I guess this one also works and is very easy and simple, but iterates twice:
def random_split(items, size):
sample = set(random.sample(items, size))
a = [x for x in items if x in sample]
b = [x for x in items if x not in sample]
return a, b
This is essentially the same as Hexparrot's solution with the set(sample) suggested by senderle to make comparisons O(1), and removing the redundant index sample and enumerate calls. You don't need that if you deal only with hashable objects.
Upvotes: 4
Reputation: 24788
Here's something in a few lines:
from random import sample
x = ['abc', 'c', 'w', 't', '3']
sample_size = len(x) // 2
sample_set = set(sample(x, sample_size))
split_list = [[x[i] for i in subset] for subset in (sorted(sample_set), sorted(set(x) - sample_set))]
Upvotes: 0
Reputation: 114481
A variation on the shuffle-sort theme...
def random_split(L, size):
index = range(len(L))
random.shuffle(index)
return ([L[i] for i in sorted(index[:size])],
[L[i] for i in sorted(index[size:])])
Upvotes: 1
Reputation: 7955
I'm taking a guess that your random_split isn't supposed to give repeat elements.
If you don't have any duplicates in the original list, this will work as a one-liner as you have it in the original post, but it uses sorting. It's a very simple, if inefficient method of doing it:
import random
x = ['abc', 'c', 'w', 't', '3']
def random_split(x, n):
k = x[:]
random.shuffle(k)
yield sorted(k[:n], key = x.index)
yield sorted(k[n:], key = x.index)
a, b = random_split(x, 3)
Example of results:
>>> a
['c', 'w', 't']
>>> b
['abc', '3']
Upvotes: 0
Reputation: 88378
Here is a transcript you can turn into a function:
>>> a = [10,20,30,40,50,60]
>>> keep = sorted(random.sample(range(len(a)),3))
>>> keep
[0, 3, 4]
>>> ([a[i] for i in keep], [a[i] for i in range(len(a)) if i not in keep])
([10, 40, 50], [20, 30, 60])
Upvotes: 1