Reputation: 951
I am writing a code for a baseline tagger. Based on the Brown corpus it assigns the most common tag to the word. So if the word "works" is tagged as verb 23 times and as a plural noun 30 times then based on that in the user input sentence it would tagged as plural noun. If the word was not found in the corpus, then it is tagged as a noun by default. The code I have so far returns every tag for the word not just the most frequent one. How can I achieve it only returning the frequent tag per word?
import nltk
from nltk.corpus import brown
def findtags(userinput, tagged_text):
uinput = userinput.split()
fdist = nltk.FreqDist(tagged_text)
result = []
for item in fdist.items():
for u in uinput:
if u==item[0][0]:
t = (u,item[0][1])
result.append(t)
continue
t = (u, "NN")
result.append(t)
return result
def main():
tags = findtags("the quick brown fox", brown.tagged_words())
print tags
if __name__ == '__main__':
main()
Upvotes: 3
Views: 2204
Reputation: 122280
If it's English, there is a default POS tagger in NLTK which a lot of people have been complaining about but it's a nice quick-fix (more like a band-aid than paracetamol), see POS tagging - NLTK thinks noun is adjective:
>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> sent = "the quick brown fox"
>>> pos_tag(word_tokenize(sent))
[('the', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN')]
If you want to train a baseline tagger from scratch, I recommend you follow an example like this but change the corpus to English one: https://github.com/alvations/spaghetti-tagger
By building a UnigramTagger like in spaghetti-tagger
, you should automatically achieve the most common tag for every word.
However, if you want to do it the non machine-learning way, first to count word:POS, What you'll need is some sort of type token ratio. also see Part-of-speech tag without context using nltk:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
from itertools import chain
def type_token_ratio(documentstream):
ttr = defaultdict(list)
for token, pos in list(chain(*documentstream)):
ttr[token].append(pos)
return ttr
def most_freq_tag(ttr, word):
return Counter(ttr[word]).most_common()[0][0]
sent1 = "the quick brown fox quick me with a quick ."
sent2 = "the brown quick fox fox me with a brown ."
documents = [sent1, sent2]
# Calculates the TTR.
documents_ttr = type_token_ratio([pos_tag(word_tokenize(i)) for i in documents])
# Best tag for the word.
print Counter(documents_ttr['quick']).most_common()[0]
# Best tags for a sentence
print [most_freq_tag(documents_ttr, i) for i in sent1.split()]
NOTE: A document stream can be defined as a list of sentences where each sentence contains a list of tokens with/out tags.
Upvotes: 3
Reputation: 4318
You can simply use Counter to find most repeated item in a list:
Python
from collections import Counter
default_tag = Counter(tags).most_common(1)[0][0]
If your question is "how does a unigram-tagger work?" you might be interested to read more NLTK source codes: http://nltk.org/_modules/nltk/tag/sequential.html#UnigramTagger
Anyways, I suggest you to read NLTK book chapter 5 specially: http://nltk.org/book/ch05.html#the-lookup-tagger
Just like the sample in the book you can have a conditional frequency distribution, which returns the best tag for each given word.
cfd = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())
In this case cfd["fox"].max()
will return the most likely tag for "fox" according to brown corpus. Then you can make a dictionary of most likely tags for each word of your sentence:
likely_tags = dict((word, cfd[word].max()) for word in "the quick brown fox".split())
Notice that, for new words in your sentence this will return errors. But if you understand the idea you can make your own tagger.
Upvotes: 0
Reputation: 10139
Create a dictionary called word_tags
whose key is a word (unannotated) and value is a list of tags in descending frequency (based on your fdist
.)
Then:
for u in uinput:
result.append(word_tags[u][0])
Upvotes: 0