Reputation: 1694
There is a very long words list, the length of list is about 360000. I want to get the each word frequency, and to be a dictionary.
For example:
{'I': 50, 'good': 30,.......}
Since the word list is large, I found it take a lot of time to compute it. Do you have faster method to accomplish this?
My code, so far, is the following:
dict_pronoun = dict([(i, lst_all_tweet_noun.count(i)) for i in
lst_all_tweet_noun])
sorted(dict_pronoun)
Upvotes: 0
Views: 1266
Reputation: 1122122
You are doing several things wrong here:
You are building a huge list first, then turn that list object into a dictionary. There is no need to use the [..]
list comprehension; just dropping the [
and ]
would turn it into a much more memory-efficient generator expression.
You are using dict()
with a loop instead of a {keyexpr: valueexpr for ... in ...}
dictionary comprehension; this would avoid a generator expression altogether and go straight to building a dictionary.
You are using list.count()
, this does a full scan of the list for every element. You turned a linear scan to count N items into a O(N**2) quadratic problem. You could simply increment an integer in the dictionary each time you find the key already is present, set the value to 0 otherwise, but there are better options (see below).
The sorted()
call is busy-work; it returns a sorted list of keys that is then discarded again. Dictionaries are not sortable, not and produce a dictionary again at any rate.
Use a collections.Counter()
object here to do your counting; it uses a linear scan:
from collections import Counter
dict_pronoun = Counter(lst_all_tweet_noun)
A Counter
has a Counter.most_common()
method which will efficiently give you output sorted by counts, which is what I suspect you wanted to achieve with the sorted()
call.
For example, to get the top K elements (where K is smaller than N, the size of the dictionary), a heapq
is used to get you those elements in O(NlogK) time (avoiding a full O(NlogN) sort).
Upvotes: 11