user1104160
user1104160

Reputation: 305

Fast String within List Searching

Using Python 3, I have a list containing over 100,000 strings (list1), each 300 characters long at most. I also have a list of over 9 million substrings (list2)--I want to count how many elements a substring in list2 appears in. For instance,

list1 = ['cat', 'caa', 'doa', 'oat']
list2 = ['at', 'ca', 'do']

I want the function to return (mapped to list2):

[2, 2, 1]

Normally, this is very simple and requires very little. However, due to the massive size of the lists, I have efficiency issues. I want to find the fastest way to return that counter list.

I've tried list comprehensions, generators, maps, loops of all kinds and I have yet to find an fast way to do this easy task. What is theoretically the fastest way to complete this objective, preferably taking O(len(list2)) steps very quickly?

Upvotes: 7

Views: 2378

Answers (3)

tobigue
tobigue

Reputation: 3617

I believe this task can be solved in linear time with an Aho Corasick string matching machine. See this answer for additional information (maybe you get ideas from the other answers to that question, too - it is almost the same task and I think Aho Corasick is the theoretically fastest way to solve this).

You will have to modify the string matching machine in such way, that instead of returning the match it increases the counter of every matched substring by one. (This should be only a minor modification).

Upvotes: 2

hughdbrown
hughdbrown

Reputation: 49013

Not sure how you could avoid having some sort of O(n**2) algorithm. Here is a simple implementation.

>>> def some_sort_of_count(list1, list2):
>>>     return [sum(x in y for y in list1) for x in list2]
>>> 
>>> list1 = ['cat', 'caa', 'doa', 'oat']
>>> list2 = ['at', 'ca', 'do']
>>> some_sort_of_count(list1, list2)
[2, 2, 1]

Upvotes: 0

puk
puk

Reputation: 16782

set M = len(list1) and N = len(list2)

For each of the N entries in list2 you're going to have to do M comparisons against the entries in list1. That's a worst case running time of O(M x N). If you take it further, lets take each entry in list2 to be of length 1 and each entry in list1 to be of length 300, then you got a running time of O(300M x N).

If performance is really an issue, try dynamic programming. Here is a start:

1) sort list2 in ascending order of length like so:

['scorch', 'scorching', 'dump', 'dumpster', 'dumpsters']

2) sort them into sublists such that each preceeding entry is a subset of the proceeding entry like so:

[['scorch', 'scorching'] , ['dump', 'dumpster', 'dumpsters']]

3) Now if you compare against list1 and 'scorch' is not in there, then you don't have to search for 'scorching' either. Likewise if 'dump' is not in there, neither is 'dumpster' or 'dumpsters'

note the worst case run time is still the same

Upvotes: 2

Related Questions