Joylove
Joylove

Reputation: 414

Python reduce a large list for speed in a for loop

I'm trying to take a list of 4 million entries and rather than iterate over them all, reduce the list in the for loop that is enumerating them as it goes along.

The reduction criteria is found in the loop. Some later my_huge_list elements contain an combination of 2 consecutive elements that allows them to be discarded immediately.

Here I'm going to remove sublists with 1,2 and A,B in them from my_huge_list.

Please note I don't know in advance that 1,2 and A,B are illegal until I go into my for loop.

output_list = []

my_huge_list = [[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4],[0,1,2,3,4],[A,B],[0,1,3,A,B],[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4]...] #to 4m assorted entries

for sublist in my_huge_list[:]: 
   pair = None
   for item_index in sublist[:-1]: #Edit for Barmar.  each item in sublist is actually an object with attributes about allowed neighbors.
     if sublist[item_index +1] in sublist[item_index].attributes['excludes_neighbors_list']:
        pair = [sublist[item_index],sublist[item_index +1]]  #TODO build a list of pairs

   if pair != None: #Don't want pair in any item of output_list
      my_huge_list = [x for x in my_huge_list if not ','.join(pair) in str(x)]  #This list comprehension sole function to reduce my_huge_list from 4m item list to 1.7m items

  #if '1, 2' in str(sublist): #Don't want 1,2 in any item of output_list
        #my_huge_list = [x for x in my_huge_list if not '1, 2' in str(x)]  #This list comprehension sole function to reduce my_huge_list

  #elif 'A, B' in str(sublist): #Don't want A,B in any item of output_list
        #my_huge_list = [x for x in my_huge_list if not 'A, B' in str(x)]  #This list comprehension sole function to reduce my_huge_list from 1.7m item list to 1.1m items


  else:
     output_list.append(sublist) 


my_huge_list
>>>[[0,1,3,4],[0,1,3,4],[0,1,3,4],[0,1,3,4]...] 

So the 'for loop' unfortunately does not seem to get any faster because my_huge_list is still iterated over all 4m entries, even though it was quickly reduced by the list comprehension.

[The my_huge_list does not need to be processed in any order and does not need to be retained after this loop.]

[I have considered making the for loop into a sub-function and using map and also the shallow copy but can't figure this architecture out.]

[I'm sure by testing that the removal of list elements by list comprehension is quicker than brute-forcing all 4m sublists.]

Thanks!

Upvotes: 1

Views: 1703

Answers (2)

r.ook
r.ook

Reputation: 13868

Here's my dig on it:

my_huge_list = [[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4],[0,1,2,3,4],['A','B'],[0,1,3,'A','B'],[0,'A','B'],[0,1,2,3,4],[0,1,3,4],[0,1,2,3,4],[0,1,3,4],[0,1,2,4]] #to 4m assorted entries

# ... do whatever and return unwanted list... #

# ... if needed, convert the returned items into lists before putting into unwanted ... #

unwanted = [[1,2], ['A','B']]

index = 0
while index < len(my_huge_list):
    sublist = my_huge_list[index]
    next = True
    for u in unwanted:
        if u in [sublist[j:j+len(u)] for j in range(len(sublist)-len(u)+1)] or u == sublist:
            my_huge_list.pop(index)
            next = False
    index += next

print(my_huge_list)

# [[0, 1, 3, 4], [0, 1, 3, 4], [0, 1, 3, 4], [0, 1, 3, 4]]

It's not elegant but it gets the job done. A huge caveat is that modifying a list while iterating over it is bad karma (pros will probably shake their heads at me), but dealing with a size of 4 mil you can understand I'm trying to save some memory by modifying in place.

This is also scale-able so that if you have multiple numbers of unwanted in different sizes, it should still catch it from your huge list. If your element size is 1, try to match the expected element type from your my_huge_list. e.g. if your my_huge_list has a [1], your unwanted should be [1] as well. If the element is a string instead of list, you'll need that string in your unwanted. int/float will however break this current code as you can't iterate over it, but you can add extra handling before you iterate through unwanted.

Upvotes: 1

Barmar
Barmar

Reputation: 780879

You're iterating over your huge list twice, once in the main for loop, and then each time you find an invalid element you iterate over it again in the list comprehensions to remove all of those invalid elements.

It would be better to simply filter those elements out of the list once with a list comprehension.

def sublist_contains(l, pair):
    for i in range(len(l)-1):
        if l[i] == pair[0] and l[i+1] == pair[1]:
            return True
    return False

output_list = [sublist for sublist in my_huge_list if not(list_contains(sublist, ['A', 'B']) or list_contains(sublist, ['1', '2']))]

My sublist_contains() function assumes it's always just two elements in a row you have to test for. You can replace this with a more general function if necessary. See elegant find sub-list in list

Upvotes: 0

Related Questions