Gergely
Gergely

Reputation: 61

Nltk lemmatizers do not recognize the plural form of chemical names

So, I must admit, I'm a total noob in nlp, and I have no idea whatsoever about nltk, I'm just trying to use a legacy code left by the previous developer. I need to lemmatize words, mostly from chemical and biotech publications. I generally use WordNetLemmatizer. Most of the time it works.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('cats')

returns cat.

But then I try

lemmatizer.lemmatize('dehydrogenases')

it returns 'dehydrogenases'. I want it to return 'dehydrogenase'. How can I do that?

Upvotes: 0

Views: 289

Answers (1)

Rahul P
Rahul P

Reputation: 2663

Explanation

If you install nltk as a module and then use the following code to initialize the WordNetLemmatizer:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

You are likely to get a LookupError that says:

LookupError: 
**********************************************************************
  Resource wordnet not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('wordnet')

  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/wordnet.zip/wordnet/

Reason

The lemmatizer that you initalized is based on WordNet. Quoting the documentation of WordNet:

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

Basically, it does not have all the words in the English dictionary to lemmatize. So, while it works for the word cats, it may not work for other words that aren't in the lexical database of WordNet.

I hope this helps.

Upvotes: 2

Related Questions