James
James

Reputation: 424

fast search for integer list in much larger set of longer integer lists

I'm looking for an efficient python implementation of the following:

I have a large set of integer lists between 4 and >100 integers in length but mostly around a length of 4-10. There could be up to a million in total depending on the dataset. They are order specific. The integers themselves will range from 0 to <=99999.

I will have input search lists of between 3 and 5 integers in length, again order specific. I need to find all examples from the larger set of integer lists, where the list contains an input search list.

e.g.: example large set of integer lists [1,40, 98, 32, 778], [7, 9347, 21, 98345, 632, 444], [87567, 4563, 97, 40, 87], [1, 40, 98, 32, 778], [4563, 97, 40, 87, 76], [935, 57342, 86, 213, 89674, 4327, 9641, 13283], [4563, 40, 87, 76, 97]

Example query [4563, 97, 40].

Result [87567, 4563, 97, 40, 87], [4563, 97, 40, 87, 76] but NOT [4563, 40, 87, 76, 97].

I can store the set of integer lists in a dict and search the keys for the query integer list but this is slow. I can write the integer lists to flat file and use grep to search them which is fast but a nasty hack. Ultimately I have further code I need to run on the results (matched lists) so I'd prefer to stay in my current python workflow.

I am aware of search algorithms like aho corasick but I'm working with integers not text and I am doing the reverse (searching whole strings for a substring).

Upvotes: 0

Views: 413

Answers (2)

user1196549
user1196549

Reputation:

If you can afford the storage and the preprocessing time, you can insert all triples, quadruples and quintuples found from the lists in three distinct dictionaries. The dictionary entries will store the sets of lists where these tuples occur, and where in the lists.

Then a query will be performed in time just proportional to the number of matches.

Upvotes: 0

Samfaitmal
Samfaitmal

Reputation: 80

first of all, I will suggest you have a look at https://wiki.python.org/moin/PythonSpeed/PerformanceTips .

Depending, for exemple, on how you write your loops, the computation time may vary a lot.

The following code works... Quid of performance ???

#Your List of lists
L = [[1,40, 98, 32, 778], [7, 9347, 21, 98345, 632, 444], [87567, 4563, 97, 40, 87], [1, 40, 98, 32, 778], [4563, 97, 40, 87, 76], [935, 57342, 86, 213, 89674, 4327, 9641, 13283], [4563, 40, 87, 76, 97]]

#Your list of search items
query= [4563, 97, 40]


def queryInList(Q,l):
    lidx = []
    for q in Q:
        try:
            lidx.append(l.index(q))
            if lidx[len(lidx)-1] < lidx[len(lidx)-2]:
                return False
        except ValueError:
            return False
    return True



l = [l for l in L if queryInList(query, l)]
print l

Upvotes: 1

Related Questions