Reputation: 1675
def count_occurrences(string):
count = 0
for text in GENERIC_TEXT_STORE:
count += text.count(string)
return count
GENERIC_TEXT_STORE is a list of string. For example:
GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that's not a test']
Given a string 'text', I want to find how many times the text, i.e. 'this', occurs in the GENERIC_TEXT_STORE. If my GENERIC_TEXT_STORE is huge, this is very slow. What are the ways to make this search and count much faster? For instance, if I split the big GENERIC_TEXT_STORE list into multiple smaller lists, would that be faster?
If the multiprocessing module is useful here, how to make it possible for this purpose?
Upvotes: 1
Views: 82
Reputation: 20669
You can use re
.
In [2]: GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that\'s not a test']
In [3]: def count_occurrences(string):
...: count = 0
...: for text in GENERIC_TEXT_STORE:
...: count += text.count(string)
...: return count
In [6]: import re
In [7]: def count(_str):
...: return len(re.findall(_str,''.join(GENERIC_TEXT_STORE)))
...:
In [28]: def count1(_str):
...: return ' '.join(GENERIC_TEXT_STORE).count(_str)
...:
Now using timeit
to analyse the execution time.
when size of GENERIC_TEXT_STORE
is 3
.
In [9]: timeit count('this')
1.27 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: timeit count_occurrences('this')
697 ns ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [33]: timeit count1('this')
385 ns ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
when size of GENERIC_TEXT_STORE
is 15000
.
In [17]: timeit count('this')
1.07 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [18]: timeit count_occurrences('this')
3.35 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [37]: timeit count1('this')
275 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
when size of GENERIC_TEXT_STORE
is 150000
In [20]: timeit count('this')
5.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: timeit count_occurrences('this')
33 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [40]: timeit count1('this')
3.98 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
when size of GENERIC_TEXT_STORE
is over a million (1500000
)
In [23]: timeit count('this')
50.3 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: timeit count_occurrences('this')
283 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [43]: timeit count1('this')
40.7 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
count1 < count < count_occurrences
When the size of GENERIC_TEXT_STORE
is large count
and count1
are almost 4 to 5 times faster than count_occurrences
.
Upvotes: 0
Reputation: 2136
First, check that your algorithm is actually doing what you want, as suggested in the comments above. The count() method is checking for substring equality, you could probably get a big improvement by refactoring your code to test only complete words assuming that's what you want. Something like this could work as your condition.
any((word==string for word in text.split()))
Multiprocessing would probably help as you could split the list into smaller lists (one per core) then add up all the results when each process finishes (avoid inter-process communication during the execution). I've found from testing that multi-processing in Python varies quite a bit between operating systems, Windows and Mac can take quite a long time to actually spawn the processes whereas Linux seems to do it much faster. Some people have said that setting a CPU affinity for each process using pstools is important but I didn't find this made much difference in my case.
Another answer would be to look at using Cython to compile your Python into a C program or alternatively rewrite the whole thing in a faster language, but as you've tagged this answer Python I assume you're not so keen on that.
Upvotes: 1