How to make this search and count much faster?

Question

def count_occurrences(string):
    count = 0
    for text in GENERIC_TEXT_STORE:
        count += text.count(string)
    return count

GENERIC_TEXT_STORE is a list of string. For example:

GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that's not a test']

Given a string 'text', I want to find how many times the text, i.e. 'this', occurs in the GENERIC_TEXT_STORE. If my GENERIC_TEXT_STORE is huge, this is very slow. What are the ways to make this search and count much faster? For instance, if I split the big GENERIC_TEXT_STORE list into multiple smaller lists, would that be faster?

If the multiprocessing module is useful here, how to make it possible for this purpose?

Ch3steR · Accepted Answer

You can use re.

In [2]: GENERIC_TEXT_STORE = ['this is good', 'this is a test', 'that\'s not a test']

In [3]: def count_occurrences(string):
   ...:     count = 0
   ...:     for text in GENERIC_TEXT_STORE:
   ...:         count += text.count(string)
   ...:     return count

In [6]: import re

In [7]: def count(_str):
   ...:     return len(re.findall(_str,''.join(GENERIC_TEXT_STORE)))
   ...:
In [28]: def count1(_str):
    ...:     return ' '.join(GENERIC_TEXT_STORE).count(_str)
    ...:

Now using timeit to analyse the execution time.

when size of GENERIC_TEXT_STORE is 3.

In [9]: timeit count('this')
1.27 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: timeit count_occurrences('this')
697 ns ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [33]: timeit count1('this')
385 ns ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

when size of GENERIC_TEXT_STORE is 15000.

In [17]: timeit count('this')
1.07 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [18]: timeit count_occurrences('this')
3.35 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [37]: timeit count1('this')
275 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

when size of GENERIC_TEXT_STORE is 150000

In [20]: timeit count('this')
5.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: timeit count_occurrences('this')
33 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [40]: timeit count1('this')
3.98 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

when size of GENERIC_TEXT_STORE is over a million (1500000)

In [23]: timeit count('this')
50.3 ms ± 7.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: timeit count_occurrences('this')
283 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [43]: timeit count1('this')
40.7 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

count1 < count < count_occurrences

When the size of GENERIC_TEXT_STORE is large count and count1 are almost 4 to 5 times faster than count_occurrences.

How to make this search and count much faster?

Answers (2)

Related Questions