Matt M.
Matt M.

Reputation: 539

NLTK WordNet Lemmatizer - How to remove the unknown words?

I'm trying to use the NLTK WordNet Lemmatizer on tweets.

I would like to remove all words that are not found in WordNet (twitter handles and such), but there is no feedback from WordNetLemmatizer.lemmatize(). It simply returns the word unchanged if it can't find it.

Is there a way to check if a word is found in WordNet or not?

Alternatively is there a better way to remove anything but "proper english words" from a string?

Upvotes: 2

Views: 2600

Answers (1)

Bob Dylan
Bob Dylan

Reputation: 1833

You can check with wordnet.synsets(token). Be sure to deal with punctuation also, then just check if it's in the list. Here's an example:

from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import wordnet

my_list_of_strings = []  # populate list before using

wpt = WordPunctTokenizer()
only_recognized_words = []

for s in my_list_of_strings:
    tokens = wpt.tokenize(s)
    if tokens:  # check if empty string
        for t in tokens:
            if wordnet.synsets(t):
                only_recognized_words.append(t)  # only keep recognized words

But you should really create some custom logic for handling Twitter data, particular handling hash tags, @replies, usernames, links, retweets, etc. There are plenty of papers with strategies to glean from.

Upvotes: 4

Related Questions