Reputation: 571
How can I verify whether I am correctly using the NLTK lemmatizer in this list comprehension, specifically whether it is taking account of the POS tags?
clean_article_string = (article_db.loc[0,'clean_text']) # pandas dataframe cell containing string.
tokens = word_tokenize(clean_article_string)
treebank_tagged_tokens = tagger.tag(tokens)
wordnet_tagged_tokens = [(w,get_wordnet_pos(t)) for (w, t) in treebank_tagged_tokens]
lemmatized_tokens = [(lemmatizer.lemmatize(w).lower(),t) for (w,t) in wordnet_tagged_tokens]
print(len(set(wordnet_tagged_tokens)),(len(set(lemmatized_tokens))))
423 384
I'm using a converter I found on Stackoverflow to switch from treebank to Wordnet tokens, and it works fine. My issue is whether for lemmatized_tokens
the lemmatizer is actually taking both the word and the tag of my (w,t)
tuple into account, or if it is just looking at the w
and lemmatizing based on that (presuming everything to be a noun). I tried...
lemmatized_tokens = [(lemmatizer.lemmatize(w,t)) for (w,t) in wordnet_tagged_tokens]
and
lemmatized_tokens = [(lemmatizer.lemmatize(w, pos=t)) for (w,t) in wordnet_tagged_tokens]
which produces a KeyError: ''
in the Wordnet lemmatize function. So the initial code actually functions, but I don't know if it is using the POS tag or not. Does anyone know whether the lemmmatizer will be taking it into account in the working code, and/or if I can verify it is?
Upvotes: 2
Views: 956
Reputation: 571
Answer by ewcz in comments. Labelled as community wiki. This helped me, might help others.
You use lemmatizer.lemmatize(w)
, then it will use the default POS tag n, the error suggests that some of the tags are empty - in this case, perhaps one could use a fallback to nouns, i.e., to use
lemmatizer.lemmatize(w, pos=t if t else 'n')
Upvotes: 1