James Allen-Robertson
James Allen-Robertson

Reputation: 571

NLTK Lemmatizing with list comprehension

How can I verify whether I am correctly using the NLTK lemmatizer in this list comprehension, specifically whether it is taking account of the POS tags?

clean_article_string = (article_db.loc[0,'clean_text']) # pandas dataframe cell containing string.
tokens = word_tokenize(clean_article_string)
treebank_tagged_tokens = tagger.tag(tokens)
wordnet_tagged_tokens = [(w,get_wordnet_pos(t)) for (w, t) in treebank_tagged_tokens]
lemmatized_tokens = [(lemmatizer.lemmatize(w).lower(),t) for (w,t) in wordnet_tagged_tokens]
print(len(set(wordnet_tagged_tokens)),(len(set(lemmatized_tokens))))
423 384

I'm using a converter I found on Stackoverflow to switch from treebank to Wordnet tokens, and it works fine. My issue is whether for lemmatized_tokens the lemmatizer is actually taking both the word and the tag of my (w,t) tuple into account, or if it is just looking at the w and lemmatizing based on that (presuming everything to be a noun). I tried...

lemmatized_tokens = [(lemmatizer.lemmatize(w,t)) for (w,t) in wordnet_tagged_tokens]

and

lemmatized_tokens = [(lemmatizer.lemmatize(w, pos=t)) for (w,t) in wordnet_tagged_tokens]

which produces a KeyError: '' in the Wordnet lemmatize function. So the initial code actually functions, but I don't know if it is using the POS tag or not. Does anyone know whether the lemmmatizer will be taking it into account in the working code, and/or if I can verify it is?

Upvotes: 2

Views: 956

Answers (1)

James Allen-Robertson
James Allen-Robertson

Reputation: 571

Answer by ewcz in comments. Labelled as community wiki. This helped me, might help others.


You use lemmatizer.lemmatize(w), then it will use the default POS tag n, the error suggests that some of the tags are empty - in this case, perhaps one could use a fallback to nouns, i.e., to use lemmatizer.lemmatize(w, pos=t if t else 'n')

Upvotes: 1

Related Questions