jackzellweger
jackzellweger

Reputation: 378

Random access over all pair-wise combinations of large list in Python

Background:

I have a list of 44906 items: large = [1, 60, 17, ...]. I also have a personal computer with limited memory (8GB), running Ubuntu 14.04.4 LTS.

The Goal:

I need to find all the pair-wise combinations of large in a memory-efficient manner, without filling a list with all the combinations beforehand.

The Problem & What I've Tried So Far:

When I use itertools.combinations(large, 2), and try to assign it to a list my memory fills up immediately, and I get very slow performance. The reason for this is that the number of pairwise combinations goes like n*(n-1)/2 where n is the number of elements of the list.

The number of combinations for n=44906 comes out to 44906*44905/2 = 1008251965. A list with this many entries is much too large to store in memory. I would like to be able to design a function so that I can plug in a number i to find the ith pair-wise combination of numbers in this list, and a way to somehow dynamically compute this combination, without referring to a 1008251965 element list that's impossible to store in memory.

An Example of What I Am Trying To Do:

Let's say I have an array small = [1,2,3,4,5]

In the configuration in which I have the code, itertools.combinations(small, 2) will return a list of tuples as such:

[(1, 2), # 1st entry
 (1, 3), # 2nd entry
 (1, 4), # 3rd entry
 (1, 5), # 4th entry
 (2, 3), # 5th entry
 (2, 4), # 6th entry 
 (2, 5), # 7th entry
 (3, 4), # 8th entry
 (3, 5), # 9th entry
 (4, 5)] # 10th entry

A call to a the function like this: `find_pair(10)' would return:

(4, 5)

, giving the 10th entry in the would-be array, but without calculating the entire combinatorial explosion beforehand.

The thing is, I need to be able to drop in to the middle of the combinations, not starting from the beginning every time, which is what it seems like an iterator does:

>>> from itertools import combinations
>>> it = combinations([1, 2, 3, 4, 5], 2)
>>> next(it)
(1, 2)
>>> next(it)
(1, 3)
>>> next(it)
(1, 4)
>>> next(it)
(1, 5)

So, instead of having to execute next() 10 times to get to the 10th combination, I would like to be able to retrieve the tuple returned by the 10th iteration with one call.

The Question

Are there any other combinatorial functions that behave this way designed to deal with huge data sets? If not, is there a good way to implement a memory-saving algorithm that behaves this way?

Upvotes: 4

Views: 1711

Answers (5)

karakfa
karakfa

Reputation: 67497

If you want random access to any combination you can use this function to return the index of a corresponding lower triangular representation of the cross-product

def comb(k):         
        row=int((math.sqrt(1+8*k)+1)/2)    
        column=int(k-(row-1)*(row)/2)  
        return [row,column]

using your small array for example

small = [1,2,3,4,5]
length = len(small)
size = int(length * (length-1)/2)
for i in range(size):
    [n,m] = comb(i)
    print(i,[n,m],"(",small[n],",",small[m],")")

will give

0 [1, 0] ( 2 , 1 )
1 [2, 0] ( 3 , 1 )
2 [2, 1] ( 3 , 2 )
3 [3, 0] ( 4 , 1 )
4 [3, 1] ( 4 , 2 )
5 [3, 2] ( 4 , 3 )
6 [4, 0] ( 5 , 1 )
7 [4, 1] ( 5 , 2 )
8 [4, 2] ( 5 , 3 )
9 [4, 3] ( 5 , 4 )

obviously if your access method is in order other methods will be more practical.

Note also that the comb function is independent of the size of the problem.

As suggested by @Blckknght in the comments to get the same order as itertools version change to

for i in range(size):
        [n,m] = comb(size-1-i) 
        print(i,[n,m],"(",small[length-1-n],",",small[length-1-m],")")  


0 [4, 3] ( 1 , 2 )
1 [4, 2] ( 1 , 3 )
2 [4, 1] ( 1 , 4 )
3 [4, 0] ( 1 , 5 )
4 [3, 2] ( 2 , 3 )
5 [3, 1] ( 2 , 4 )
6 [3, 0] ( 2 , 5 )
7 [2, 1] ( 3 , 4 )
8 [2, 0] ( 3 , 5 )
9 [1, 0] ( 4 , 5 )

Upvotes: 4

Prune
Prune

Reputation: 77837

I started with that triangular arrangement, finding the subscript k for list members indexed row and col. Then I reversed the process, deriving row and col from k.

For a list large of N items, let

b = 2*N - 1

Now, to get the kth combination in the list ...

row = (b - math.sqrt(b*b - 8*k)) // 2
col = k - (2*N - row + 1)*row / 2
kth_pair = large[row][col]

This allows you to access any member of the combinations list without ever generating that list.

Upvotes: 3

Łukasz Rogalski
Łukasz Rogalski

Reputation: 23213

For well defined order of created pairs, indices of first and second elements should be related to n and length of a sequence. If you'll find them, you'll be able to achieve const-time performance, since indexing lists is O(1) operation.

Pseudocode would look like this:

def find_nth_pair(seq, n):
    idx1 = f1(n, len(seq))  # some formula of n and len(seq)
    idx2 = f2(n, len(seq))  # some formula of n and len(seq)
    return (seq[idx1], seq[idx2])

You only need to find formulas for idx1 and idx2.

Upvotes: 1

kcborys
kcborys

Reputation: 316

So you've got 44906 items. Notice, however, that if you build your combinations the same way you build them in the example then there are 44905 combinations with large[0] as the first number. Furthermore, combination i for i <= 44905 looks like (large[0], large[i]).

For 44905 < i <= 89809, it looks like (large[1],large[i-44904]).

If I'm not mistaken, this pattern should continue with something like (large[j],large[i-(exclusive lower bound for j)+1]). You can check my math on that but I'm pretty sure it's right. Anyways, you could iterate to find these lower bounds (so for j=0, it's 0, for j=1, it's 44905, etc.) Iterating should be easy because you just add the next descending number: 44905, 44905+44904, 44905+44904+44903...

Upvotes: 1

Tim Peters
Tim Peters

Reputation: 70592

Except itertools.combinations does not return a list - it returns an iterator. Here:

>>> from itertools import combinations
>>> it = combinations([1, 2, 3, 4, 5], 2)
>>> next(it)
(1, 2)
>>> next(it)
(1, 3)
>>> next(it)
(1, 4)
>>> next(it)
(1, 5)
>>> next(it)
(2, 3)
>>> next(it)
(2, 4)

and so on. It's extremely memory-efficient: only one pair is produced per invocation.

Of course it is possible to write a function that returns the n'th result, but before bothering with that (which will be slower and more involved), are you quite sure you can't just use combinations() the way it was designed to be used (i.e., iterating over it, instead of forcing it to produce a giant list)?

Upvotes: 5

Related Questions