Reputation: 491
When I print nltk.corpus.brown.tagged_words()
it prints about 1161192 tuples with words and their associated tags.
I want to distinguish different distinct words having different distinct tags. One word can have more than one tag.
Append list items by number of hyphens available I tried every code with this thread but I am not getting any word more than 3 tags. As far as I know, there are words with even 8 or 9 tags also.
Where my approach is wrong? How to resolve this? I have two different questions:
How to figure out the count of different words of the corpus under different distinct tags? the number of distinct words in the corpus having let's say 8 distinct tags.
Again, I want to know word with the greatest number of distinct tags.
And, I have interest with words only. I am removing punctuations.
Upvotes: 3
Views: 9531
Reputation: 50220
The NLTK provides the perfect tool to index all tags used for each word:
wordtags = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())
Or if you want to case-fold the words as you go:
wordtags = nltk.ConditionalFreqDist((w.lower(), t) for w, t in brown.tagged_words())
We now have an index of the tags belonging to each word (plus their frequencies, which the OP didn't care about):
>>> print(wordtags["clean"].items())
dict_items([('JJ', 48), ('NN-TL', 1), ('RB', 1), ('VB-HL', 1), ('VB', 18)])
To find the words with the most tags, fall back on general Python sorting:
>>> wtlist = sorted(wordtags.items(), key=lambda x: len(x[1]), reverse=True)
>>> for word, freqs in wtlist[:10]:
print(word, "\t", len(freqs), list(freqs))
that 15 ['DT', 'WPS-TL', 'CS-NC', 'DT-NC', 'WPS-NC', 'WPS', 'NIL', 'CS-HL', 'WPS-HL',
'WPO-NC', 'DT-TL', 'DT-HL', 'CS', 'QL', 'WPO']
a 13 ['NN-TL', 'AT-NC', 'NP', 'AT', 'AT-TL-HL', 'NP-HL', 'NIL', 'AT-TL', 'NN',
'NP-TL', 'AT-HL', 'FW-IN-TL', 'FW-IN']
(etc.)
Upvotes: 1
Reputation: 3836
A two-line way to find the word with the greatest number of distinct tags (along with its tags):
word2tags = nltk.Index(set(nltk.corpus.brown.tagged_words()))
print(max(word2tags.items(), key=lambda wt: len(wt[1])))
Upvotes: 0
Reputation: 122168
Use a defaultdict(Counter)
to keep track of words and their POS. Then sort the dictionary by the keys' len(Counter)
:
from collections import defaultdict, Counter
from nltk.corpus import brown
# Keeps words and pos into a dictionary
# where the key is a word and
# the value is a counter of POS and counts
word_tags = defaultdict(Counter)
for word, pos in brown.tagged_words():
word_tags[word][pos] +=1
# To access the POS counter.
print 'Red', word_tags['Red']
print 'Marlowe', word_tags['Marlowe']
print
# Greatest number of distinct tag.
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0]
print word_with_most_distinct_pos
print word_tags[word_with_most_distinct_pos]
print len(word_tags[word_with_most_distinct_pos])
[out]:
Red Counter({u'JJ-TL': 49, u'NP': 21, u'JJ': 3, u'NN-TL': 1, u'JJ-TL-HL': 1})
Marlowe Counter({u'NP': 4})
that
Counter({u'CS': 6419, u'DT': 1975, u'WPS': 1638, u'WPO': 135, u'QL': 54, u'DT-NC': 6, u'WPS-NC': 3, u'CS-NC': 2, u'WPS-HL': 2, u'NIL': 1, u'CS-HL': 1, u'WPO-NC': 1})
12
To get words with X no. of distinct POS:
# Words with 8 distinct POS
word_with_eight_pos = filter(lambda x: len(word_tags[x]) == 8, word_tags.keys())
for i in word_with_eight_pos:
print i, word_tags[i]
print
# Words with 9 distinct POS
word_with_nine_pos = filter(lambda x: len(word_tags[x]) == 9, word_tags.keys())
for i in word_with_nine_pos:
print i, word_tags[i]
[out]:
a Counter({u'AT': 21824, u'AT-HL': 40, u'AT-NC': 7, u'FW-IN': 4, u'NIL': 3, u'FW-IN-TL': 1, u'AT-TL': 1, u'NN': 1})
: Counter({u':': 1558, u':-HL': 138, u'.': 46, u':-TL': 22, u'IN': 20, u'.-HL': 8, u'NIL': 1, u',': 1, u'NP': 1})
Upvotes: 6
Reputation: 5971
You can use itertools.groupby
to achieve what you want. Do note that the following code is just quickly bashed together and most likely not the most efficient way to achieve your goal (I'll leave it up to you to optimise it), however it does the job...
import itertools
import operator
import nltk
for k, g in itertools.groupby(sorted(nltk.corpus.brown.tagged_words()), key=operator.itemgetter(0)):
print k, set(map(operator.itemgetter(1), g))
Output:
...
yonder set([u'RB'])
yongst set([u'JJT'])
yore set([u'NN', u'PP$'])
yori set([u'FW-NNS'])
you set([u'PPSS-NC', u'PPO', u'PPSS', u'PPO-NC', u'PPO-HL', u'PPSS-HL'])
you'd set([u'PPSS+HVD', u'PPSS+MD'])
you'll set([u'PPSS+MD'])
you're set([u'PPSS+BER'])
...
Upvotes: 0