West
West

Reputation: 2570

Python: Fastest way to generate unique lists from a range of numbers

Im trying to generate 5,760,000 unique lists, i.e. no repeating elements in each list and no one list is the same. Each list will have 24 numbers, ranging from 1 to 10,000. Order matters so for example (1,2,3) and (3,2,1) will be considered unique.

I tried the following just to check if random.sample gives me unique lists out of the box but it has been going on forever:

import random
mainlist = []

for i in range(1,5760001):
    r=random.sample(range(1,10000),24)
    if r not in mainlist:
        mainlist.append(r)

How can I best solve this

Upvotes: 1

Views: 979

Answers (2)

tobias_k
tobias_k

Reputation: 82889

The main problem with your approach is that mainlist is a list, meaning that if r not in mainlist will take more and more time. Change it to a set (and the actual elements to tuple so they are hashable), then the lookup will be much faster. In fact, you can just drop the in check, as a set can not have duplicates in the first place. Also note that your for loop will just try 5760000 numbers, but does (theoretically1) not ensure that the list contains that many elements in the end, so better use while.

mainset = set()
while len(mainset) < 5760000:
    mainset.add(tuple(random.sample(range(1, 10001), 24)))

For 5760000 numbers, this will still take a while, but the complexity should be down from O(n²) to O(n). If the sample-size approaches the maximum number of possible combinations, this will also get dramatically slow as you get more and more clashes, but for your numbers this should not be a problem.1 In fact, you might just as well just use a list comprehension without any checks:2

mainset = [random.sample(range(1, 10001), 24) for _ in range(5760000)] 

Alternatively, you can generate all those combinations by combining itertools.combinations with permutations,3 but of course those will be massively too many2 in your case. (Seriously, do not run this with your numbers!)

>>> from itertools import combinations, permutations
>>> nums = range(5)
>>> [p for c in combinations(nums, 3) for p in permutations(c)]
[(0, 1, 2), (0, 2, 1), ... 57 more ... (4, 3, 2)]

1) In fact, the number of possible combinations is so immensely higher2 than 5.7M that you could just use a list and drop that in check and most probably still would have not a single clash.

2) Just to put it into perspective: There are (n over k) combinations, each having k! permutations. For n=10000 and k=24, this makes 1.56e72 combinations and 6.2e23 permutations per combination for a total of 9.72e95 possible elements. Comparing that to the meager 5.7e6 in your sample, the probability for a collision is practically zero.

3) Or just permutations with 2nd parameter.

Upvotes: 3

kcsquared
kcsquared

Reputation: 5409

If you're looking for any lists satisfying those criteria, this will give you a generator for the 5760000 lexicographically earliest unique lists of size 24; calling list() on mainlist will put the list into memory.

from itertools import permutations, islice
mainlist = islice(permutations(range(1, 10000), 24), 5760000)

For uniformly random lists, on the other hand, tobias's answer gives the solution. If speed or memory is a concern, using a generator, such as those given by islice or starmap, is recommended with large datasets.

mainlist = starmap(random.sample, repeat((range(1, 10000), 24), 5760000))

Upvotes: 1

Related Questions