Shifu
Shifu

Reputation: 2185

How do I find the most common words in multiple separate texts?

Bit of a simple question really, but I can't seem to crack it. I have a string that is formatted in the following way:

["category1",("data","data","data")]
["category2", ("data","data","data")]

I call the different categories posts and I want to get the most frequent words from the data section. So I tried:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

However, this will give me the top words PER post in the string.

I need a general top words list.
However if I take print top out of the for loop, it only gives me the results of the last post.
Does anyone have an idea?

Upvotes: 4

Views: 3904

Answers (4)

Janus Troelsen
Janus Troelsen

Reputation: 21318

from itertools import chain
from collections import Counter
from nltk.tokenize import wordpunct_tokenize
texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"]
print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)

outputs:

[('a', 4), ('yellow', 2), ('quick', 2)]

As you can see in the documentation for Counter.most_common, the returned list is sorted.

To use with your code, you can do

texts = (x[1] for x in posts)

or you can do

... wordpunct_tokenize(x[1]) for x in texts ...

If your posts actually look like this:

posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]

You can get rid of the categories:

texts = list(chain.from_iterable(x[1] for x in posts))

(texts will be ['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer'])

You can then use that in the snippet of the top of this answer.

Upvotes: 2

likeitlikeit
likeitlikeit

Reputation: 5638

This is a scope problem. Also, you don't need to initialize the elements of defaultdict, so this simplifies your code:

Try it like this:

posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

This, as expected, outputs

['data1', 'data3', 'data5', 'data2']

as a result.

If you really have something like

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

as an input, you won't need wordpunct_tokenize() as the input data is already tokenized. Then, the following would work:

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

from collections import defaultdict
freq_dict = defaultdict(int)

for cat, tokens in posts:
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

and it also outputs the expected result:

['data1', 'data3', 'data5', 'data2']

Upvotes: 3

pradyunsg
pradyunsg

Reputation: 19486

Just change your code to allow for the posts to be processed and then get the top words:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict

freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       freq_dict[token] += 1
# get top after all posts have been processed.
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

Upvotes: 1

Fredrik Pihl
Fredrik Pihl

Reputation: 45670

Why not just use Counter?

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]

Upvotes: 3

Related Questions