Reputation: 1793
I am working on an application that requires me to extract keywords (and finally generate a tag cloud of these words) from a stream of conversations. I am considering the following steps:
Up till here, nltk provides all the tools I need.After this, however I need to somehow "rank" these words and come up with most important words. Can anyone suggest me what tools from nltk might be used for this ?
Thanks Nihit
Upvotes: 2
Views: 3477
Reputation: 868
I guess it depends on your definition of "important". If you are talking about frequency, then you can just build a dictionary using words (or stems) as keys, and then counts as values. Afterwards, you can sort the keys in the dictionary based on their count.
Something like (not tested):
from collections import defaultdict
#Collect word statistics
counts = defaultdict(int)
for sent in stemmed_sentences:
for stem in sent:
counts[stem] += 1
#This block deletes all words with count <3
#They are not relevant and sorting will be way faster
pairs = [(x,y) for x,y in counts.items() if y >= 3]
#Sort (stem,count) pairs based on count
sorted_stems = sorted(pairs, key = lambda x: x[1])
Upvotes: 3