Reputation: 3
I'm currently working on lemmantizing a sentence while also applying pos_tags. This is what I have so far
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
lem = WordNetLemmatizer()
def findTag(sentence):
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
pos_label = nltk.pos_tag(sentence)[0][1][0].lower()
if pos_label == "j":
pos_label == "a"
if pos_label in ["a", "n", "v"]:
print(lem.lemmatize(word, pos = pos_label))
elif pos_label in ['r']:
print(wordnet.synset(sentence+".r.1").lemmas()[0].pertainyms()[0].name())
else:
print(lem.lemmatize(sentence))
findTag("I love running angrily")
However, when I input a sentence with this I get the error
Traceback (most recent call last):
File "spoilerDetect.py", line 25, in <module>
findTag("I love running angrily")
File "spoilerDetect.py", line 22, in findTag
print(lem.lemmatize(sentence))
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/nltk/corpus/reader/wordnet.py", line 1905, in _morphy
if form in exceptions:
TypeError: unhashable type: 'list'
I understand that lists are unhashable but am unsure of how to fix this. Do I change lists to a tuple or is there something I'm not understanding?
Upvotes: 0
Views: 1095
Reputation: 122280
Lets walk through the code and see how to get your desired output.
First the imports, you have
import nltk
from nltk import pos_tag
and then you were using
pos_label = nltk.pos_tag(...)
Since you're already using from nltk import pos_tag
, the pos_tag
is already in the global namespace, just do:
pos_label = pos_tag(...)
Idiomatically, the imports should be cleaned up a little to look like this:
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
Next actually keeping the list of tokenized words and then the list of pos tags and then the list of lemmas separately sounds logical but since the function finally only returns the function, you should be able to chain up the pos_tag(word_tokenize(...))
function and iterate through it so that you can retrieve the POS tag and tokens, i.e.
sentence = "I love running angrily"
for word, pos in pos_tag(word_tokenize(sentence)):
print(word, '|', pos)
[out]:
I | PRP
love | VBP
running | VBG
angrily | RB
Now, we know that there's a mismatch between the outputs of pos_tag
and the POS that the WordNetLemmatizer
is expecting. From https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L124, there is a function call penn2morphy
that looks like this:
def penn2morphy(penntag, returnNone=False, default_to_noun=False) -> str:
"""
Converts tags from Penn format (input: single string) to Morphy.
"""
morphy_tag = {'NN':'n', 'JJ':'a', 'VB':'v', 'RB':'r'}
try:
return morphy_tag[penntag[:2]]
except:
if returnNone:
return None
elif default_to_noun:
return 'n'
else:
return ''
An example:
>>> penn2morphy('JJ')
'a'
>>> penn2morphy('PRP')
''
And if we use these converted tags as inputs to the WordNetLemmatizer
and reusing your if-else conditions:
sentence = "I love running angrily"
for token, pos in pos_tag(word_tokenize(sentence)):
morphy_pos = penn2morphy(pos)
if morphy_pos in ["a", "n", "v"]:
print(wnl.lemmatize(token, pos=morphy_pos))
elif morphy_pos in ['r']:
print(wn.synset(token+".r.1").lemmas()[0].pertainyms()[0].name())
else:
print(wnl.lemmatize(token))
[out]:
I
love
run
angry
Hey, what did you do there? Your code works but mine doesn't!
Okay, now that we know how to get the desired output. Lets recap.
But how is it that my code doesn't work?!
Okay, lets work through your code to see why you're getting the error.
First lets check every output that you get within the findTag
function, printing the type of the output and the output
sentence = "I love running angrily"
sentence = word_tokenize(sentence)
print(type(sentence))
print(sentence)
[out]:
<class 'list'>
['I', 'love', 'running', 'angrily']
At sentence = word_tokenize(sentence)
, you have already overwritten your original variable to the function, usually that's a sign of error later on =)
Now lets look at the next line:
sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
print(type(sentence))
print(sentence)
[out]:
<class 'list'>
['I', 'love', 'running', 'angrily']
Now we see that the sentence = [i.strip(" ") for i in sentence]
is actually meaningless given the example sentence.
Q: But is it true that all tokens output by word_tokenize
would have no trailing/heading spaces which i.strip(' ')
is trying to do?
A: Yes, it seems like so. Then NLTK first performs the regex operations on strings, then call the str.split()
function which would have removed heading/trailing spaces between tokens, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/destructive.py#L141
Lets continue:
sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
pos_label = nltk.pos_tag(sentence)[0][1][0].lower()
print(type(pos_label))
print(pos_label)
[out]:
<class 'str'>
p
Q: Wait a minute, where is the pos_label
only a single string?
Q: And what is POS tag p
?
A: Lets look closer what's happening in nltk.pos_tag(sentence)[0][1][0].lower()
Usually, when you have to do such [0][1][0]
nested index retrieval, its error prone. We need to ask what's [0][1][0]
?
We know that sentence now after sentence = word_tokenize(sentence)
has became a list of strings. And pos_tag(sentence)
would return a list of tuples of strings where the first item in the tuple is the token and the second the POS tag, i.e.
sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
thing = pos_tag(sentence)
print(type(thing))
print(thing)
[out]:
<class 'list'>
[('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
Now if we know thing = pos_tag(word_tokenize("I love running angrily"))
, outputs the above, lets work with that to see what [0][1][0]
is accessing.
>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1]
('I', 'PRP')
So thing[0]
outputs the tuple of (token, pos)
for the first token.
>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1]
'PRP'
And thing[0][1]
outputs the POS for the first token.
>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1][0]
'P'
Achso, the [0][1][0]
looks for the first character of the POS of the first token.
So the question is that the desired behavior? If so, why are you only looking at the POS of the first word?
Regardless of what I'm looking at. Your explanation still that doesn't tell me why the TypeError: unhashable type: 'list'
occurs. Stop distracting me and tell me how to resolve the TypeError
!!
Okay, we move on, now that we know thing = pos_tag(word_tokenize("I love running angrily"))
and thing[0][1][0].lower() = 'p'
Given your if-else conditions,
if pos_label in ["a", "n", "v"]:
print(lem.lemmatize(word, pos = pos_label))
elif pos_label in ['r']:
print(wordnet.synset(sentence+".r.1").lemmas()[0].pertainyms()[0].name())
else:
print(lem.lemmatize(sentence))
we find that 'p'
value would have gone to the else, i.e. print(lem.lemmatize(sentence))
but wait a minute remember what has sentence
became after you've modified it with:
>>> sentence = word_tokenize("I love running angrily")
>>> sentence = [i.strip(" ") for i in sentence]
>>> sentence
['I', 'love', 'running', 'angrily']
So what happens if we just ignore all the rest of the code and focus on this:
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
sentence = ['I', 'love', 'running', 'angrily']
lem.lemmatize(sentence)
[out]:
-------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-497ae98ecaa3> in <module>
4 sentence = ['I', 'love', 'running', 'angrily']
5
----> 6 lem.lemmatize(sentence)
~/Library/Python/3.6/lib/python/site-packages/nltk/stem/wordnet.py in lemmatize(self, word, pos)
39
40 def lemmatize(self, word, pos=NOUN):
---> 41 lemmas = wordnet._morphy(word, pos)
42 return min(lemmas, key=len) if lemmas else word
43
~/Library/Python/3.6/lib/python/site-packages/nltk/corpus/reader/wordnet.py in _morphy(self, form, pos, check_exceptions)
1903 # 0. Check the exception lists
1904 if check_exceptions:
-> 1905 if form in exceptions:
1906 return filter_forms([form] + exceptions[form])
1907
TypeError: unhashable type: 'list'
Ah ha!! That's where the error occurs!!!
It's because WordNetLemmatizer
is expecting a single string input and you're putting a list of strings. Example usage:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
token = 'words'
wnl.lemmatize(token, pos='n')
Q: Why didn't you just get to the point?!
A: Then you would miss out on how to debug your code and make it better =)
Upvotes: 1