ling
ling

Reputation: 1675

How to make this search and count much faster?

def count_occurrences(string):
    count = 0
    for text in GENERIC_TEXT_STORE:
        count += text.count(string)
    return count

GENERIC_TEXT_STORE is a list of string. For example:

GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that's not a test']

Given a string 'text', I want to find how many times the text, i.e. 'this', occurs in the GENERIC_TEXT_STORE. If my GENERIC_TEXT_STORE is huge, this is very slow. What are the ways to make this search and count much faster? For instance, if I split the big GENERIC_TEXT_STORE list into multiple smaller lists, would that be faster?

If the multiprocessing module is useful here, how to make it possible for this purpose?

Upvotes: 1

Views: 82

Answers (2)

Ch3steR
Ch3steR

Reputation: 20669

You can use re.

In [2]: GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that\'s not a test']

In [3]: def count_occurrences(string):
   ...:     count = 0
   ...:     for text in GENERIC_TEXT_STORE:
   ...:         count += text.count(string)
   ...:     return count

In [6]: import re

In [7]: def count(_str):
   ...:     return len(re.findall(_str,''.join(GENERIC_TEXT_STORE)))
   ...:
In [28]: def count1(_str):
    ...:     return ' '.join(GENERIC_TEXT_STORE).count(_str)
    ...:

Now using timeit to analyse the execution time.

when size of GENERIC_TEXT_STORE is 3.

In [9]: timeit count('this')
1.27 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: timeit count_occurrences('this')
697 ns ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [33]: timeit count1('this')
385 ns ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

when size of GENERIC_TEXT_STORE is 15000.

In [17]: timeit count('this')
1.07 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [18]: timeit count_occurrences('this')
3.35 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [37]: timeit count1('this')
275 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

when size of GENERIC_TEXT_STORE is 150000

In [20]: timeit count('this')
5.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: timeit count_occurrences('this')
33 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [40]: timeit count1('this')
3.98 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

when size of GENERIC_TEXT_STORE is over a million (1500000)

In [23]: timeit count('this')
50.3 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: timeit count_occurrences('this')
283 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: timeit count1('this')
40.7 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

count1 < count < count_occurrences

When the size of GENERIC_TEXT_STORE is large count and count1 are almost 4 to 5 times faster than count_occurrences.

Upvotes: 0

Simon Notley
Simon Notley

Reputation: 2136

First, check that your algorithm is actually doing what you want, as suggested in the comments above. The count() method is checking for substring equality, you could probably get a big improvement by refactoring your code to test only complete words assuming that's what you want. Something like this could work as your condition.

any((word==string for word in text.split()))

Multiprocessing would probably help as you could split the list into smaller lists (one per core) then add up all the results when each process finishes (avoid inter-process communication during the execution). I've found from testing that multi-processing in Python varies quite a bit between operating systems, Windows and Mac can take quite a long time to actually spawn the processes whereas Linux seems to do it much faster. Some people have said that setting a CPU affinity for each process using pstools is important but I didn't find this made much difference in my case.

Another answer would be to look at using Cython to compile your Python into a C program or alternatively rewrite the whole thing in a faster language, but as you've tagged this answer Python I assume you're not so keen on that.

Upvotes: 1

Related Questions