Finn LeSueur
Finn LeSueur

Reputation: 529

Increasing speed/performance in counting through very large lists in Python

I'm writing a program in Python 3 and part of it's functionality is to find out what word occurs the most in list and to return the number of occurrences of that word. I have code that works, but part of the requirement is that it take a list of 200,000+ words and complete this activity in under a few seconds, and my code takes a really long time to run. I was wondering what suggestions you may have for speed improvements in this method.

def max_word_frequency(words):
    """A method that takes a list and finds the word with the most
    occurrences and returns the number of occurences of that word
    as an integer.
    """
    max_count = 0
    for word in set(words):
        count = words.count(word)
        if count > max_count:
            max_count = count

    return max_count

I have contemplated using a dictionary as they are hashable and super speedy compared to lists but I cannot quite figure out how to implement this yet.

Thank you for your time everyone!
- Finn

Upvotes: 2

Views: 864

Answers (1)

Maxime Lorant
Maxime Lorant

Reputation: 36181

First, your algorithm is looping m times over the whole list of 200 000 words, where m is the number of distinct words in this list. This is really not a good idea for just counting iterations of word and select the maximum. I could show you a more efficient algorithm (which could only iterate one time over the list), but Python already has tools to do what you want.

To solve your problem with few lines of code, you can use Python algorithm available in the standard library, which have been implemented in C and might be more efficient than your loop. The Counter class with its most_common method might help you:

>>> from collections import Counter
>>> counts = Counter(['abc', 'def', 'abc', 'foo', 'bar', 'foo', 'foo'])
>>> counts
Counter({'foo': 3, 'abc': 2, 'bar': 1, 'def': 1})
>>> Counter(['abc', 'def', 'abc', 'foo', 'bar', 'foo', 'foo']).most_common(1)
[('foo', 3)]

The you just have to return the second element of the tuple (there is only one tuple here, as we ask by the 1 argument in most_common)

Performance comparaison

Just to compare, I took a sample of a LaTeX file (~12Ko), split words by spaces (giving the x with 1835 words) and run your function and the one below with timeit. You can see a real gain.

>>> len(x)
1835
>>> def max_word_2(words):
...     counts = Counter(words)
...     return counts.most_common(1)[0][1]
>>> timeit.timeit("max_word_2(x)", setup="from __main__ import x, max_word_2", number=1000)
1.1040630340576172
>>> timeit.timeit("max_word_frequency(x)", setup="from __main__ import x, max_word_frequency", number=1000)
35.623037815093994

Just this change might be sufficient to speed up your process :)

Upvotes: 5

Related Questions