ded
ded

Reputation: 430

Python - building sublist that meets certain criteria from huge number of combinations

Been reading a long time, first time I have not been able to find an answer to something I'm working on.

I have a list of 93 strings which are each 6 characters long. From those 93 strings, I want to identify a set of 20 which all meet a particular criteria relative to the others in the set. While itertools.combinations will give me all possible combinations, not all conditions are worth checking.

For instance if [list[0], list[1], etc] fails because list[0] and list[1] can not be together it doesn't matter what the other 18 strings are, the set will fail every time, and that is a ton of wasted checking.

Currently I have this working with 20 nested for loops, but it seems like there has to be a better/faster way to do it.:

for n1 in bclist:
    building = [n1]
    n2bclist = [bc for bc in bclist if bc not in building]
    for n2 in n2bclist:              #this is the start of what gets repeated 19 times
        building.append(n2)
        if test_function(building): #does set fail? (counter intuitive, True when fail, False when pass)
            building.remove(n2)
            continue
        n3bclist = [bc for bc in bclist if bc not in building]
        #insert the additional 19 for loops, with n3 in n3, n4 in n4, etc
        building.remove(n2)

There are print statements in the 20th for loop to alert me if a set of 20 even exists. The for statements at least allow me to skip sets early when the single addition fails, but there is no memory of when larger combinations fail:

For instance [list[0], list[1]] fails, so skip to [list[0], [list[2]] which passes. Next is [list[0], list[2], list[1]] which will fail because 0 and 1 are together again so it will move to [list[0], list[2], list[3]] which may or not pass. My concern is that eventually it will also test:

All of these combinations will have the same outcome as the previous ones. Basically I trade the devil of itertools.combinations testing all combinations of sets which I know fail because of early values which fail for the devil of for loops which treat order of values as a factor when I do not care about their order. Both methods significantly increase the time it will take for my code to complete.

Any ideas on how to get rid of the devils would be greatly appreciated.

Upvotes: 3

Views: 746

Answers (4)

ecatmur
ecatmur

Reputation: 157484

You should be basing your solution on itertools.combinations as that will take care of the ordering issue; short-circuiting filtering is relatively easy to solve.

Recursive solution

Let's quickly review how to implement combinations works; the easiest way is to take the nested-for-loop approach and convert it to recursive style:

def combinations(iterable, r):
    pool = tuple(iterable)
    for i in range(0, len(pool)):
        for j in range(i + 1, len(pool)):
            ...
                yield (i, j, ...)

Converted to recursive form:

def combinations(iterable, r):
    pool = tuple(iterable)
    def inner(start, k, acc):
        if k == r:
            yield acc
        else:
            for i in range(start, len(pool)):
                for t in inner(i + 1, k + 1, acc + (pool[i], )):
                    yield t
    return inner(0, 0, ())

Applying a filter is now easy:

def combinations_filterfalse(predicate, iterable, r):
    pool = tuple(iterable)
    def inner(start, k, acc):
        if predicate(acc):
            return
        elif k == r:
            yield acc
        else:
            for i in range(start, len(pool)):
                for t in inner(i + 1, k + 1, acc + (pool[i], )):
                    yield t
    return inner(0, 0, ())

Let's check this:

>>> list(combinations_filterfalse(lambda t: sum(t) % 2 == 1, range(5), 2))
[(0, 2), (0, 4), (2, 4)]

Iterative solution

The actual implementation of itertools.combinations listed in the documentation uses an iterative loop:

def combinations(iterable, r):
    pool = tuple(iterable)
    n = len(pool)
    if r > n:
        return
    indices = range(r)
    yield tuple(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        yield tuple(pool[i] for i in indices)

In order to fit in a predicate elegantly, it's necessary to reorder the loop slightly:

def combinations_filterfalse(predicate, iterable, r):
    pool = tuple(iterable)
    n = len(pool)
    if r > n or predicate(()):
        return
    elif r == 0:
        yield ()
        return
    indices, i = range(r), 0
    while True:
        while indices[i] + r <= i + n:
            t = tuple(pool[k] for k in indices[:i+1])
            if predicate(t):
                indices[i] += 1
            elif len(t) == r:
                yield t
                indices[i] += 1
            else:
                indices[i+1] = indices[i] + 1
                i += 1
        if i == 0:
            return
        i -= 1
        indices[i] += 1

Checking again:

>>> list(combinations_filterfalse(lambda t: sum(t) % 2 == 1, range(5), 2))
[(0, 2), (0, 4), (2, 4)]
>>> list(combinations_filterfalse(lambda t: t == (1, 4), range(5), 2))
[(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (2, 3), (2, 4), (3, 4)]
>>> list(combinations_filterfalse(lambda t: t[-1] == 3, range(5), 2))
[(0, 1), (0, 2), (0, 4), (1, 2), (1, 4), (2, 4)]
>>> list(combinations_filterfalse(lambda t: False, range(5), 2))
[(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
>>> list(combinations_filterfalse(lambda t: False, range(5), 0))
[()]

Comparison

It turns out that the recursive solution is not only simpler but is also faster:

In [33]: timeit list(combinations_filterfalse_rec(lambda t: False, range(20), 5))
10 loops, best of 3: 24.6 ms per loop

In [34]: timeit list(combinations_filterfalse_it(lambda t: False, range(20), 5))
10 loops, best of 3: 76.6 ms per loop

Upvotes: 0

Stuart
Stuart

Reputation: 9868

You can make use of two aspects of the problem:

  1. order doesn't matter
  2. if test_function(L) is True then test_function of any sub-list of L will also be True

You can also simplify things a bit by dealing with the indices 0-92 instead of list[0]-list[92] - it's only within test_function that we might care what the contents of the list are.

The following code does that by first finding viable pairs, then sets of four, sets of eight, and sets of sixteen. Finally it finds all viable combinations of a sixteen and a four to get your list of 20. However there were over 100,000 sets of eight and so it was still way too slow and I gave up. Possibly you can do something along the same lines but speed it up with itertools, but probably not enough.

target = range(5, 25)
def test_function(L):
    for i in L:
        if not i in target:
            return True
def possible_combos(A, B):
    """
    Find all possible pairings of a list within A and a list within B
    """
    R = []
    for i in A:
        for j in B:
            if i[-1] < j[0] and not test_function(i + j):
                R.append(i + j)
    return R
def possible_doubles(A):
    """
    Find all possible pairings of two lists within A
    """
    R = []
    for n, i in enumerate(A):
        for j in A[n + 1:]:
            if i[-1] < j[0] and not test_function(i + j):
                R.append(i + j)
    return R
# First, find all pairs that are okay
L = range(92) 
pairs = []
for i in L:
    for j in L[i + 1:]:
        if not test_function([i, j]):
            pairs.append([i, j])

# Then, all pairs of pairs
quads = possible_doubles(pairs)
print "fours", len(quads), quads[0]
# Then all sets of eight, and sixteen
eights = possible_doubles(quads)
print "eights", len(eights), eights[0]
sixteens = possible_doubles(eights)
print "sixteens", len(sixteens), sixteens[0]

# Finally check all possible combinations of a sixteen plus a four
possible_solutions = possible_combos(sixteens, fours)
print len(possible_solutions), possible_solutions[0]

EDIT: I found a much better solution. First, identify all pairs of values within the range (0-92) that conform with test_function, keeping the pairs in order. Presumably the 1st value of the 1st pair must be the 1st value of the solution, and the 2nd value of the last pair must be the last value of the solution (but check... is this assumption true for test_function? If this isn't a safe assumption then you will need to repeat find_paths for all the possible values of the start and finish). Then find a path from the 1st to last value that is 20 values long and also conforms to test_function.

def test_function(S):
    for i in S:
        if not i in target:
            return True
    return False

def find_paths(p, f):
    """ Find paths from end of p to f, check they are the right length,
        and check they conform to test_function
    """
    successful = []
    if p[-1] in pairs_dict:
        for n in pairs_dict[p[-1]]:
            p2 = p + [n]
            if (n == f and len(p2) == target_length and
                not test_function(p2)):
                successful.append(p2)
            else:
                successful += find_paths(p2, f)
    return successful

list_length = 93              # this is the number of possible elements
target = [i * 2 for i in range(5, 25)] 
    # ^ this is the unknown target list we're aiming for...
target_length = len(target)   # ... we only know its length
L = range(list_length - 1)
pairs = []
for i in L:
    for j in L[i + 1:]:
        if not test_function([i, j]):
            pairs.append([i, j])
firsts = [a for a, b in pairs]
nexts = [[b for a, b in pairs if a == f] for f in firsts]
pairs_dict = dict(zip(firsts, nexts))
print "Found solution(s):", find_paths([pairs[0][0]], pairs[-1][1])

Upvotes: 0

NPE
NPE

Reputation: 500883

def gen(l, n, test, prefix=()):
  if n == 0:
    yield prefix
  else:
    for i, el in enumerate(l):
      if not test(prefix + (el,)):
        for sub in gen(l[i+1:], n - 1, test, prefix + (el,)):
          yield sub

def test(l):
  return sum(l) % 3 == 0 # just a random example for testing

print list(gen(range(5), 3, test))

This will choose subsets of cardinality n from l such that test(subset) == False.

It tries to avoid unnecessary work. However, given that there are 1e20 ways to choose 20 elements out of 93, you may need to rethink your overall approach.

Upvotes: 1

Andrew Clark
Andrew Clark

Reputation: 208655

Use your current method, but keep track of indices as well so that in your inner loops you can skip elements you would have already checked:

bcenum = list(enumerate(bclist))
for i1, n1 in bcenum:
    building = [n1]
    for i2, n2 in bcenum[i1+1:]:              #this is the start of what gets repeated 19 times
        building.append(n2)
        if test_function(building): #does set fail? (counter intuitive, True when fail, False when pass)
            building.remove(n2)
            continue
        for i3, n3 in bcenum[i2+1:]:
            # more nested loops
        building.remove(n2)

Upvotes: 1

Related Questions