Wasim Aftab
Wasim Aftab

Reputation: 808

Most efficient way to find indices of common tuples in two very large lists of tuples in python 3?

I have two large lists of tuples: "neg" (length ~40K) and "All" (length ~2M) whose proxies can be downloaded from the following link

ftp://ftp.lrz.de/transfer/List_Intersect/

I would like to search "neg" inside "All" and return the matched indices in "All". I tried the following solution which takes 788.1487 seconds on a fairly powerful pc (see specs below). Moreover, it doesn't retain the correct order.

https://stackoverflow.com/a/39500933/6524326

In fact, the following code does the required job in 202.6451 seconds. Can It be made even faster?

def findTupleIndices(smallList, bigList):
 comList = sorted(set(smallList) & set(bigList), key=smallList.index)
 idx = [bigList.index(x) for x in comList]
 return(idx)

pc specs

Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz, 32GB RAM DDR4-2133 MHz

Upvotes: 1

Views: 25

Answers (1)

RomanPerekhrest
RomanPerekhrest

Reputation: 92884

Smashingly with temporary hash-table (dict) where tuples of "big list" are keys and their indices - values.

Initial approach stats:

from timeit import timeit

def findTupleIndices(sub_lst, search_lst):
    comList = sorted(set(sub_lst) & set(search_lst), key=sub_lst.index)
    idx = [search_lst.index(x) for x in comList]
    return idx   

# sub_lst, search_lst are lists of tuples extracted from `ftp://ftp.lrz.de/transfer/List_Intersect/`

print(timeit('findTupleIndices(sub_lst, search_lst)', 'from __main__ import findTupleIndices, sub_lst, search_lst', number=1000))

The output:

191.43023270001868

New approach stats:

from timeit import timeit

def find_tuple_indices(sub_lst, search_lst):
    pos_dict = dict((t,i) for i, t in enumerate(search_lst))
    return [pos_dict[t] for i, t in enumerate(sub_lst) if t in pos_dict]

# sub_lst, search_lst are lists of tuples extracted from `ftp://ftp.lrz.de/transfer/List_Intersect/`

print(timeit('find_tuple_indices(sub_lst, search_lst)', 'from __main__ import find_tuple_indices, sub_lst, search_lst', number=1000))

The output:

1.4070011030125897

Upvotes: 1

Related Questions