MSeifert
MSeifert

Reputation: 152637

Fastest way to chain multiple indexing operations for lists?

I want to classify some data I have and for that to work I want to chain indexing of python lists. Simplified I have a nested list:

lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]

and I want to iterate over the product of the first two indices but keep the third one the same:

from itertools import product

for index1, index2 in product(range(3), range(2)):
    print(lst[index1][index2][0])

However I want to make this more general without knowing in advance how many substructures deep this needs to go (I want to make the number of ranges passed to itertools.product variable).

I'm struggling a bit how I can generalize the [index1][index2][0] to accept an arbitary number of indices, the best I could come up with was functools.reduce:

from functools import reduce

for indices in product(range(3), range(2)):
    print(reduce(list.__getitem__, indices, lst)[0])

That seems awfully complicated (and is significantly slower than the manual indexing), so I was wondering if there was a better and faster way to do that. I'm using both python 2.x and 3.x and external libraries are definetly okay (however it shouldn't require NumPy or packages based on NumPy).

Upvotes: 4

Views: 683

Answers (3)

B. M.
B. M.

Reputation: 18628

I propose a recursive way.

def theshape(lst):
    l=lst
    shape=[]
    while isinstance(l,list):
                shape.append(len(l))
                l=l[0]
    return shape

This function is intended to find the shape of your structure, which is supposed regular until last dimension.

def browse(lst):
    shape=theshape(lst)
    ndim=len(shape)
    def level(l,k):
        if k==ndim:
            print(l)
        else:
            for i in range(shape[k]):
                level(l[i],k+1)
    level(lst,0)

This one browse recursively all the levels. it minimize pointer changes.

A simple example :

u=arange(2**6).reshape(4,2,1,2,2,1,1,2).tolist()
browse(u)
0
2
.
.
.
62

Some tests on big structures (with print discarded by print = lambda _ : None) :

def go(lst):
 for x in product(*[range(k) for k in theshape(lst)]):
    print(reduce(lambda result, index: result[index], x, lst))

In [1]: u=arange(2**21).reshape([2]*21).tolist()

In [2]: %time go(u)
Wall time: 14.8 s

In [3]: %time browse(u)
Wall time: 3.5 s

In [5]: u=arange(2**21).reshape([1]*30+[2**21]+[1]).tolist()

In [6]: %time go(u)
Wall time: 18 s

In [7]: %time browse(u)
Wall time: 3.48 s

In [8]: u=arange(2**21).reshape([1]+[2**21]+[1]*30).tolist()

In [9]: %time go(u)
Wall time: 14 s

In [10]: %time browse(u)
Wall time: 58.1 s

This shows that performance is very data-structure dependent.

EDIT:

Finally, simplest is fastest. theshape is not necessary.

def browse2(lst):
        if isinstance(lst,list):
            for l in lst:
                browse2(l)
        else: print(lst)

it is often 30% faster than browse. And it works whatever the structure of the list.

Upvotes: 2

2ps
2ps

Reputation: 15926

I would just use the python-builtin reduce for this, and it doesn’t seem that complex and it wasn’t that much slower in my testing:

from itertools import product

for x in product(range(3), range(2)):
    rg = reduce(lambda result, index: result[index], x, lst)
    value = rg[0]

If you're worried about the timing penalty from reduce, you can just use a for loop instead:

for x in product(range(3), range(2)):
    value = lst
    for index in x:
        value = value[index]
    value = value[0]

This will, in all cases, be slower than manual indexing because a for loop will require extra operations for figuring out the stop condition. The question, as always, is whether the speed optimization is worth it to you for the flexibility of arbitrary depth specification.

As for why you would use reduce vs. for, there is an ongoing stylistic debate within the JavaScript community as to whether you should use the reduce, map, filter functions on Arrays or do the for loop version instead because it is faster, and you may want to refer to that debate to pick which side you fall on.


Timing with for loop:

In [22]: stmt = '''
    ...: from itertools import product
    ...: def go():
    ...:   lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]
    ...:   for x in product(range(3), range(2)):
    ...:     # rg = reduce(lambda result, index: result[index], x, lst)
    ...:     value = lst
    ...:     for index in x:
    ...:         value = value[index]
    ...:     value = value[0]
    ...:     # value = lst[x[0]][x[1]][0]
    ...: '''

In [23]: timeit(setup=stmt, stmt='go()', number=1000000)
Out[23]: 4.003296852111816

Timing with reduce:

In [18]: stmt = '''
    ...: from itertools import product
    ...: def go():
    ...:   lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]
    ...:   for x in product(range(3), range(2)):
    ...:     rg = reduce(lambda result, index: result[index], x, lst)
    ...:     value = rg[0]
    ...:     # value = lst[x[0]][x[1]][0]
    ...: '''

In [19]: timeit(setup=stmt, stmt='go()', number=1000000)
Out[19]: 6.164631128311157

Timing with manual indexing:

In [16]: stmt = '''
    ...: from itertools import product
    ...: def go():
    ...:   lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]
    ...:   for x in product(range(3), range(2)):
    ...:     # rg = reduce(lambda result, index: result[index], x, lst)
    ...:     value = lst[x[0]][x[1]][0]
    ...: '''

In [17]: timeit(setup=stmt, stmt='go()', number=1000000)
Out[17]: 3.633723020553589

Upvotes: 1

f5r5e5d
f5r5e5d

Reputation: 3706

how about dynamically creating the hard indexing?

lst = [[[1], [2]], [[3, 3], [4]], [[5], [6,6,6]]]

from itertools import product

for index1, index2 in product(range(3), range(2)):
    print(lst[index1][index2][0])


# need depth info from somewhere to create hard coded indexing

prod_gen = product(range(3), range(2))

first = next(prod_gen)

indx_depth = len(first) + 1

exec( ('def IndexThisList(lst, indxl):\n' +
       '        return lst' + ''.join(('[indxl[' + str(i) + ']]' 
                                           for i in range(indx_depth)))))

# just to see what it exec'd:
print(("def IndexThisList(lst, indx_itrbl):\n" +
       "        return lst" + ''.join(('[indx_itrbl[' + str(i) + ']]' 
                                       for i in range(indx_depth)))))
# the exec is only invoked again when changing the indexing depth
# for accessing the list with its currently instantiated depth of indexing
# just use the current instance of the generated function

print(IndexThisList(lst, first + (0,)))
for itpl in prod_gen: 
    print (IndexThisList(lst, itpl + (0,)))

1
2
3
4
5
6
def IndexThisList(lst, indx_itrbl):
        return lst[indx_itrbl[0]][indx_itrbl[1]][indx_itrbl[2]]
1
2
3
4
5
6

just a beginner at this, seems like my exec should be wrapped with another function to pass the index_depth but its eluding me for now

Upvotes: 1

Related Questions