Reputation: 2185
I'm attempting to make a counter which uses a list of POS trigrams to check to a large list of trigrams and find their frequency. My code so far is as follows:
from nltk import trigrams
from nltk.tokenize import wordpunct_tokenize
from nltk import bigrams
from collections import Counter
import nltk
text= ["This is an example sentence."]
trigram_top= ['PRP', 'MD', 'VB']
for words in text:
tokens = wordpunct_tokenize (words)
tags = nltk.pos_tag (tokens)
trigram_list=trigrams(tags)
list_tri=Counter (t for t in trigram_list if t in trigram_top)
print list_tri
I get an empty counter back. How do I mend this? In an earlier version I did get data back, but it kept counting up for ever iteration (in the real program, text is a collection of different files). Does anyone have an idea?
Upvotes: 1
Views: 254
Reputation: 142156
Let's put some print
in there to debug:
from nltk import trigrams
from nltk.tokenize import wordpunct_tokenize
from nltk import bigrams
from collections import Counter
import nltk
text= ["This is an example sentence."]
trigram_top= ['PRP', 'MD', 'VB']
for words in text:
tokens = wordpunct_tokenize (words)
print tokens
tags = nltk.pos_tag (tokens)
print tags
list_tri=Counter (t[0] for t in tags if t[1] in trigram_top)
print list_tri
#['This', 'is', 'an', 'example', 'sentence', '.']
#[('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('sentence', 'NN'), ('.', '.')]
#Counter()
Note that the list=
part was redundant and I've changed the generator to just take the word instead of the pos tag
We can see that none of the pos tags directly match your trigram_top - you may want to amend your comparison check to cater for VB/VBZ...
A possibility would be changing the line:
list_tri=Counter (t[0] for t in tags if t[1].startswith(tuple(trigram_top)))
# Counter({'is': 1})
Upvotes: 2