tktktk0711
tktktk0711

Reputation: 1694

How to faster compute the count frequency of words in a large words list with python and be a dictionary

There is a very long words list, the length of list is about 360000. I want to get the each word frequency, and to be a dictionary.

For example:

{'I': 50, 'good': 30,.......}

Since the word list is large, I found it take a lot of time to compute it. Do you have faster method to accomplish this?

My code, so far, is the following:

  dict_pronoun = dict([(i, lst_all_tweet_noun.count(i)) for i in 
                        lst_all_tweet_noun])
  sorted(dict_pronoun)

Upvotes: 0

Views: 1266

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122122

You are doing several things wrong here:

  • You are building a huge list first, then turn that list object into a dictionary. There is no need to use the [..] list comprehension; just dropping the [ and ] would turn it into a much more memory-efficient generator expression.

  • You are using dict() with a loop instead of a {keyexpr: valueexpr for ... in ...} dictionary comprehension; this would avoid a generator expression altogether and go straight to building a dictionary.

  • You are using list.count(), this does a full scan of the list for every element. You turned a linear scan to count N items into a O(N**2) quadratic problem. You could simply increment an integer in the dictionary each time you find the key already is present, set the value to 0 otherwise, but there are better options (see below).

  • The sorted() call is busy-work; it returns a sorted list of keys that is then discarded again. Dictionaries are not sortable, not and produce a dictionary again at any rate.

Use a collections.Counter() object here to do your counting; it uses a linear scan:

from collections import Counter

dict_pronoun = Counter(lst_all_tweet_noun)

A Counter has a Counter.most_common() method which will efficiently give you output sorted by counts, which is what I suspect you wanted to achieve with the sorted() call.

For example, to get the top K elements (where K is smaller than N, the size of the dictionary), a heapq is used to get you those elements in O(NlogK) time (avoiding a full O(NlogN) sort).

Upvotes: 11

Related Questions