Reputation: 111
I am currently having trouble with this.
I was given a task is to implement a function that return a sorted list of distinct words with a given part of speech. I am required to use NLTK's pos_tag_sents and NLTK's tokeniser to count the specific words.
I had a similar question to this and got it working thanks to some help from other users from Stack Overflow. And trying to use the same method to solve this problem.
Here is what I have have so far in my code:
import nltk
import collections
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('brown')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
def pos_counts(text, pos_list):
"""Return the sorted list of distinct words with a given part of speech
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> pos_counts(emma, ['DET', 'NOUN'])
[14352, 32029] - expected result
"""
text = nltk.word_tokenize(text)
tempword = nltk.pos_tag_sents(text, tagset="universal")
counts = nltk.FreqDist(tempword)
return [counts[x] or 0 for x in pos_list]
There are a doctest that should give the result of: [14352, 32029]
I ran my code and got this error message:
Error
**********************************************************************
File "C:/Users/PycharmProjects/a1/a1.py", line 29, in a1.pos_counts
Failed example:
pos_counts(emma, ['DET', 'NOUN'])
Exception raised:
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.4\helpers\pycharm\docrunner.py", line 140, in __run
compileflags, 1), test.globs)
File "<doctest a1.pos_counts[1]>", line 1, in <module>
pos_counts(emma, ['DET', 'NOUN'])
File "C:/Users/PycharmProjects/a1/a1.py", line 35, in pos_counts
counts = nltk.FreqDist(tempword)
File "C:\Users\PycharmProjects\a1\venv\lib\site-packages\nltk\probability.py", line 108, in __init__
Counter.__init__(self, samples)
File "C:\Users\AppData\Local\Programs\Python\Python36-32\lib\collections\__init__.py", line 535, in __init__
self.update(*args, **kwds)
File "C:\Users\PycharmProjects\a1\venv\lib\site-packages\nltk\probability.py", line 146, in update
super(FreqDist, self).update(*args, **kwargs)
File "C:\Users\AppData\Local\Programs\Python\Python36-32\lib\collections\__init__.py", line 622, in update
_count_elements(self, iterable)
TypeError: unhashable type: 'list'
I feel I'm getting close but I don't know what I'm doing wrong.
Any help will be very appreciated. Thank you.
Upvotes: 0
Views: 2055
Reputation: 1109
One way to do it would be like this:
import nltk
def pos_count(text, pos_list):
sents = nltk.tokenize.sent_tokenize(text)
words = (nltk.word_tokenize(sent) for sent in sents)
tagged = nltk.pos_tag_sents(words, tagset='universal')
tags = [tag[1] for sent in tagged for tag in sent]
counts = nltk.FreqDist(tag for tag in tags if tag in pos_list)
return counts
It's all very well explained in the nltk book. Test:
In [3]: emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
In [4]: pos_count(emma, ['DET', 'NOUN'])
Out[4]: FreqDist({'DET': 14352, 'NOUN': 32029})
EDIT: it's a good idea to use FreqDist
when you need to count things such as part of speech tags. I don't think it's very clever to have a function return a plain list with results, in principle how would you know which number represent which tag?
A possible (imho bad) solution is to return a sorted list of FreqDist.values()
. This way the results are sorted in accordance with alphabetic order of the tag names. If you really want this replace return counts
with return [item[1] for item in sorted(counts.items())]
in the definition of the function above.
Upvotes: 2