Reputation: 2185
Bit of a simple question really, but I can't seem to crack it. I have a string that is formatted in the following way:
["category1",("data","data","data")]
["category2", ("data","data","data")]
I call the different categories posts and I want to get the most frequent words from the data section. So I tried:
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
if token in freq_dict:
freq_dict[token] += 1
else:
freq_dict[token] = 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
However, this will give me the top words PER post in the string.
I need a general top words list.
However if I take print top out of the for loop, it only gives me the results of the last post.
Does anyone have an idea?
Upvotes: 4
Views: 3904
Reputation: 21318
from itertools import chain
from collections import Counter
from nltk.tokenize import wordpunct_tokenize
texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"]
print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)
outputs:
[('a', 4), ('yellow', 2), ('quick', 2)]
As you can see in the documentation for Counter.most_common, the returned list is sorted.
To use with your code, you can do
texts = (x[1] for x in posts)
or you can do
... wordpunct_tokenize(x[1]) for x in texts ...
If your posts actually look like this:
posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]
You can get rid of the categories:
texts = list(chain.from_iterable(x[1] for x in posts))
(texts
will be ['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer']
)
You can then use that in the snippet of the top of this answer.
Upvotes: 2
Reputation: 5638
This is a scope problem. Also, you don't need to initialize the elements of defaultdict
, so this simplifies your code:
Try it like this:
posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
freq_dict[token] += 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
This, as expected, outputs
['data1', 'data3', 'data5', 'data2']
as a result.
If you really have something like
posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]
as an input, you won't need wordpunct_tokenize()
as the input data is already tokenized. Then, the following would work:
posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, tokens in posts:
for token in tokens:
freq_dict[token] += 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
and it also outputs the expected result:
['data1', 'data3', 'data5', 'data2']
Upvotes: 3
Reputation: 19486
Just change your code to allow for the posts to be processed and then get the top words:
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
freq_dict[token] += 1
# get top after all posts have been processed.
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
Upvotes: 1
Reputation: 45670
Why not just use Counter?
In [30]: from collections import Counter
In [31]: data=["category1",("data","data","data")]
In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})
In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]
Upvotes: 3