Reputation: 61
So, I must admit, I'm a total noob in nlp, and I have no idea whatsoever about nltk, I'm just trying to use a legacy code left by the previous developer. I need to lemmatize words, mostly from chemical and biotech publications. I generally use WordNetLemmatizer. Most of the time it works.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('cats')
returns cat.
But then I try
lemmatizer.lemmatize('dehydrogenases')
it returns 'dehydrogenases'. I want it to return 'dehydrogenase'. How can I do that?
Upvotes: 0
Views: 289
Reputation: 2663
Explanation
If you install nltk
as a module and then use the following code to initialize the WordNetLemmatizer:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
You are likely to get a LookupError
that says:
LookupError:
**********************************************************************
Resource wordnet not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('wordnet')
For more information see: https://www.nltk.org/data.html
Attempted to load corpora/wordnet.zip/wordnet/
Reason
The lemmatizer that you initalized is based on WordNet. Quoting the documentation of WordNet:
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
Basically, it does not have all the words in the English dictionary to lemmatize. So, while it works for the word cats, it may not work for other words that aren't in the lexical database of WordNet.
I hope this helps.
Upvotes: 2