user12922264
user12922264

Reputation: 3

Why do I get TypeError: unhashable type when using NLTK lemmatizer on sentence?

I'm currently working on lemmantizing a sentence while also applying pos_tags. This is what I have so far

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

lem = WordNetLemmatizer()

def findTag(sentence):
    sentence = word_tokenize(sentence)
    sentence = [i.strip(" ") for i in sentence]
    pos_label = nltk.pos_tag(sentence)[0][1][0].lower()

    if pos_label == "j":
        pos_label == "a"

    if pos_label in ["a", "n", "v"]:
        print(lem.lemmatize(word, pos = pos_label))
    elif pos_label in ['r']: 
        print(wordnet.synset(sentence+".r.1").lemmas()[0].pertainyms()[0].name())
    else:
        print(lem.lemmatize(sentence))


findTag("I love running angrily")

However, when I input a sentence with this I get the error

Traceback (most recent call last):
  File "spoilerDetect.py", line 25, in <module>
    findTag("I love running angrily")
  File "spoilerDetect.py", line 22, in findTag
    print(lem.lemmatize(sentence))
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
    lemmas = wordnet._morphy(word, pos)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/nltk/corpus/reader/wordnet.py", line 1905, in _morphy
    if form in exceptions:
TypeError: unhashable type: 'list'

I understand that lists are unhashable but am unsure of how to fix this. Do I change lists to a tuple or is there something I'm not understanding?

Upvotes: 0

Views: 1095

Answers (1)

alvas
alvas

Reputation: 122280

Lets walk through the code and see how to get your desired output.

First the imports, you have

import nltk
from nltk import pos_tag

and then you were using

pos_label = nltk.pos_tag(...)

Since you're already using from nltk import pos_tag, the pos_tag is already in the global namespace, just do:

pos_label = pos_tag(...)

Idiomatically, the imports should be cleaned up a little to look like this:

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

Next actually keeping the list of tokenized words and then the list of pos tags and then the list of lemmas separately sounds logical but since the function finally only returns the function, you should be able to chain up the pos_tag(word_tokenize(...)) function and iterate through it so that you can retrieve the POS tag and tokens, i.e.

sentence = "I love running angrily"
for word, pos in pos_tag(word_tokenize(sentence)):
    print(word, '|', pos)

[out]:

I | PRP
love | VBP
running | VBG
angrily | RB

Now, we know that there's a mismatch between the outputs of pos_tag and the POS that the WordNetLemmatizer is expecting. From https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L124, there is a function call penn2morphy that looks like this:

def penn2morphy(penntag, returnNone=False, default_to_noun=False) -> str:
    """
    Converts tags from Penn format (input: single string) to Morphy.
    """
    morphy_tag = {'NN':'n', 'JJ':'a', 'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        if returnNone:
            return None
        elif default_to_noun:
            return 'n'
        else:
            return ''

An example:

>>> penn2morphy('JJ')
'a'
>>> penn2morphy('PRP')
''

And if we use these converted tags as inputs to the WordNetLemmatizer and reusing your if-else conditions:

sentence = "I love running angrily"
for token, pos in pos_tag(word_tokenize(sentence)):
    morphy_pos = penn2morphy(pos)
    if morphy_pos in ["a", "n", "v"]:
        print(wnl.lemmatize(token, pos=morphy_pos))
    elif morphy_pos in ['r']: 
        print(wn.synset(token+".r.1").lemmas()[0].pertainyms()[0].name())
    else:
        print(wnl.lemmatize(token))

[out]:

I
love
run
angry

Hey, what did you do there? Your code works but mine doesn't!

Okay, now that we know how to get the desired output. Lets recap.

  • First, we clean up imports
  • Then, we clean up the preprocessing (without keeping intermediate variables)
  • Then, we "functionalized" the conversion of POS tags from Penn -> Morphy
  • Lastly, we applied the same if/else conditions and run the lemmatizer.

But how is it that my code doesn't work?!

Okay, lets work through your code to see why you're getting the error.

First lets check every output that you get within the findTag function, printing the type of the output and the output

sentence = "I love running angrily"
sentence = word_tokenize(sentence)
print(type(sentence))
print(sentence)

[out]:

<class 'list'>
['I', 'love', 'running', 'angrily']

At sentence = word_tokenize(sentence), you have already overwritten your original variable to the function, usually that's a sign of error later on =)

Now lets look at the next line:

sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]

print(type(sentence))
print(sentence)

[out]:

<class 'list'>
['I', 'love', 'running', 'angrily']

Now we see that the sentence = [i.strip(" ") for i in sentence] is actually meaningless given the example sentence.

Q: But is it true that all tokens output by word_tokenize would have no trailing/heading spaces which i.strip(' ') is trying to do?

A: Yes, it seems like so. Then NLTK first performs the regex operations on strings, then call the str.split() function which would have removed heading/trailing spaces between tokens, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/destructive.py#L141

Lets continue:

sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
pos_label = nltk.pos_tag(sentence)[0][1][0].lower()

print(type(pos_label))
print(pos_label)

[out]:

<class 'str'>
p

Q: Wait a minute, where is the pos_label only a single string?

Q: And what is POS tag p?

A: Lets look closer what's happening in nltk.pos_tag(sentence)[0][1][0].lower()

Usually, when you have to do such [0][1][0] nested index retrieval, its error prone. We need to ask what's [0][1][0]?

We know that sentence now after sentence = word_tokenize(sentence) has became a list of strings. And pos_tag(sentence) would return a list of tuples of strings where the first item in the tuple is the token and the second the POS tag, i.e.

sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
thing = pos_tag(sentence)

print(type(thing))
print(thing)

[out]:

<class 'list'>
[('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]

Now if we know thing = pos_tag(word_tokenize("I love running angrily")), outputs the above, lets work with that to see what [0][1][0] is accessing.

>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1]
('I', 'PRP')

So thing[0] outputs the tuple of (token, pos) for the first token.

>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1]
'PRP'

And thing[0][1] outputs the POS for the first token.

>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1][0]
'P'

Achso, the [0][1][0] looks for the first character of the POS of the first token.

So the question is that the desired behavior? If so, why are you only looking at the POS of the first word?


Regardless of what I'm looking at. Your explanation still that doesn't tell me why the TypeError: unhashable type: 'list' occurs. Stop distracting me and tell me how to resolve the TypeError!!

Okay, we move on, now that we know thing = pos_tag(word_tokenize("I love running angrily")) and thing[0][1][0].lower() = 'p'

Given your if-else conditions,

if pos_label in ["a", "n", "v"]:
    print(lem.lemmatize(word, pos = pos_label))
elif pos_label in ['r']: 
    print(wordnet.synset(sentence+".r.1").lemmas()[0].pertainyms()[0].name())
else:
    print(lem.lemmatize(sentence))

we find that 'p' value would have gone to the else, i.e. print(lem.lemmatize(sentence)) but wait a minute remember what has sentence became after you've modified it with:

>>> sentence = word_tokenize("I love running angrily")
>>> sentence = [i.strip(" ") for i in sentence]
>>> sentence 
['I', 'love', 'running', 'angrily']

So what happens if we just ignore all the rest of the code and focus on this:

from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()
sentence = ['I', 'love', 'running', 'angrily']

lem.lemmatize(sentence)

[out]:

-------------------------------------------------------------------------
TypeError                               Traceback (most recent call last)
<ipython-input-34-497ae98ecaa3> in <module>
      4 sentence = ['I', 'love', 'running', 'angrily']
      5 
----> 6 lem.lemmatize(sentence)

~/Library/Python/3.6/lib/python/site-packages/nltk/stem/wordnet.py in lemmatize(self, word, pos)
     39 
     40     def lemmatize(self, word, pos=NOUN):
---> 41         lemmas = wordnet._morphy(word, pos)
     42         return min(lemmas, key=len) if lemmas else word
     43 

~/Library/Python/3.6/lib/python/site-packages/nltk/corpus/reader/wordnet.py in _morphy(self, form, pos, check_exceptions)
   1903         # 0. Check the exception lists
   1904         if check_exceptions:
-> 1905             if form in exceptions:
   1906                 return filter_forms([form] + exceptions[form])
   1907 

TypeError: unhashable type: 'list'

Ah ha!! That's where the error occurs!!!

It's because WordNetLemmatizer is expecting a single string input and you're putting a list of strings. Example usage:

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
token = 'words'
wnl.lemmatize(token, pos='n')

Q: Why didn't you just get to the point?!

A: Then you would miss out on how to debug your code and make it better =)

Upvotes: 1

Related Questions