Reputation: 305
Using Python 3, I have a list containing over 100,000 strings (list1), each 300 characters long at most. I also have a list of over 9 million substrings (list2)--I want to count how many elements a substring in list2 appears in. For instance,
list1 = ['cat', 'caa', 'doa', 'oat']
list2 = ['at', 'ca', 'do']
I want the function to return (mapped to list2):
[2, 2, 1]
Normally, this is very simple and requires very little. However, due to the massive size of the lists, I have efficiency issues. I want to find the fastest way to return that counter list.
I've tried list comprehensions, generators, maps, loops of all kinds and I have yet to find an fast way to do this easy task. What is theoretically the fastest way to complete this objective, preferably taking O(len(list2))
steps very quickly?
Upvotes: 7
Views: 2378
Reputation: 3617
I believe this task can be solved in linear time with an Aho Corasick string matching machine. See this answer for additional information (maybe you get ideas from the other answers to that question, too - it is almost the same task and I think Aho Corasick is the theoretically fastest way to solve this).
You will have to modify the string matching machine in such way, that instead of returning the match it increases the counter of every matched substring by one. (This should be only a minor modification).
Upvotes: 2
Reputation: 49013
Not sure how you could avoid having some sort of O(n**2) algorithm. Here is a simple implementation.
>>> def some_sort_of_count(list1, list2):
>>> return [sum(x in y for y in list1) for x in list2]
>>>
>>> list1 = ['cat', 'caa', 'doa', 'oat']
>>> list2 = ['at', 'ca', 'do']
>>> some_sort_of_count(list1, list2)
[2, 2, 1]
Upvotes: 0
Reputation: 16782
set M = len(list1)
and N = len(list2)
For each of the N entries in list2
you're going to have to do M comparisons against the entries in list1
. That's a worst case running time of O(M x N)
. If you take it further, lets take each entry in list2
to be of length 1 and each entry in list1
to be of length 300, then you got a running time of O(300M x N)
.
If performance is really an issue, try dynamic programming. Here is a start:
1) sort list2
in ascending order of length like so:
['scorch', 'scorching', 'dump', 'dumpster', 'dumpsters']
2) sort them into sublists such that each preceeding entry is a subset of the proceeding entry like so:
[['scorch', 'scorching'] , ['dump', 'dumpster', 'dumpsters']]
3) Now if you compare against list1
and 'scorch'
is not in there, then you don't have to search for 'scorching'
either. Likewise if 'dump'
is not in there, neither is 'dumpster'
or 'dumpsters'
note the worst case run time is still the same
Upvotes: 2