Reputation: 596
I'm not super experienced with Python, but I want to do some Data analytics with a corpus, so I'm doing that part in NLTK Python.
I want to go through the entire corpus and make a dictionary containing every word that appears in the corpus dataset. I want to be able to then search for a word in this dictionary and find the number of times that this word appeared as what part of speech (tag). So, for example, if I were to search for 'dog' I might find 100 noun tags and 5 verb tags, etc.
The final goal is to externally save this file as .txt or something and load it in another program to check probability of a word being which tag..
Would I do this with Counter and ngrams?
Upvotes: 1
Views: 4821
Reputation: 50200
Since you just want the POS of loose words you don't need ngrams, you need a tagged corpus. Assuming your corpus is already tagged, you can do it like this.
>>> from nltk.corpus import brown
>>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())
>>> wordcounts["set"].tabulate(10)
VBN VB NN VBD VBN-HL NN-HL
159 88 86 71 2 2
A ConditionalFreqDist
is basically a dictionary of Counter
objects, with some extras thrown in. Look it up in the NLTK docs.
PS. If you want to case-normalize your words before counting, use
wordcounts = nltk.ConditionalFreqDist((w.lower(), t) for w, t in brown.tagged_words())
Upvotes: 2