Frank_O
Frank_O

Reputation: 37

How to do lemmatization using NLTK or pywsd

I know that my explaination is rather long but I found it necessary. Hopefully someone is patient and a helpful soul :) I'm doing a sentiment analysis project atm and I'm stuck i the pre-process part. I did the import of the csv file, made it into a dataframe, transformed the variables/columns into the right data types. Then I did the tokenization like this, where i choose the variable I wanted to tokenize (tweet content) in the dataframe (df_tweet1):

# Tokenization
tknzr = TweetTokenizer()
tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]
for i in tokenized_sents:
    print(i)

The output is a list of list with words (tokens).

Then I perform stop word removal:

# Stop word removal
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

clean_sents = []
for m in tokenized_sents:
    stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]
    clean_sents.append(stop_m)

The output is the same but without stop words

The next two steps are confusing to me (part-of-speech tagging and lemmatization). I tried two things:

1) Convert the previous output into a list of strings

new_test = [' '.join(x) for x in clean_sents]

since I thought that would enable me to use this code to do both steps in one:

from pywsd.utils import lemmatize_sentence

text = new_test
lemm_text = lemmatize_sentence(text, keepWordPOS=True)

I got the this error: TypeError: expected string or bytes-like object

2) Perform POS and lemmatizaion seperately. First POS using clean_sents as input:

# PART-OF-SPEECH        
def process_content(clean_sents):
    try:
        tagged_list = []  
        for lst in clean_sents[:500]: 
            for item in lst:
                words = nltk.word_tokenize(item)
                tagged = nltk.pos_tag(words)
                tagged_list.append(tagged)
        return tagged_list

    except Exception as e:
        print(str(e))

output_POS_clean_sents = process_content(clean_sents)

The output is a list of lists with words with a tag attached Then I want to lemmatize this output, but how? I tried two modules, but both gave me error:

from pywsd.utils import lemmatize_sentence

lemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]
              for s in output_POS_clean_sents]

# AND

from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in s]
              for s in output_POS_clean_sents]
print(lemmatized)

The errors were respectively:

TypeError: expected string or bytes-like object

AttributeError: 'tuple' object has no attribute 'endswith'

Upvotes: 1

Views: 806

Answers (2)

Edoardo Guerriero
Edoardo Guerriero

Reputation: 1250

If you're using a dataframe I suggest you to store the pre processing steps results in a new column. In this way you can always check the output, and you can always create a list of lists to use as an input for a model in a line of code afterwords. Another advantage of this approach is that you can easily visualise the line of preprocessing and add other steps wherever you need without getting confused.

Regarding your code, It can be optimised (for example you could perform stop words removal and tokenisation at the same time) and I see a bit of confusion about the steps you performed. For example you performe multiple times lemmatisation, using also different libraries, and there is no point in doing that. In my opinion nltk works just fine, personally I use other libraries to preprocess tweets only to deal with emojis, urls and hashtags, all stuff specifically related to tweets.

# I won't write all the imports, you get them from your code
# define new column to store the processed tweets
df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)

tknzr = TweetTokenizer()
lmtzr = WordNetLemmatizer()

stop_words = set(stopwords.words("english"))
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

# iterate through each tweet
for ind, row in df_tweet1.iterrows():

    # get initial tweet: ['This is the initial tweet']
    tweet = row['Tweet Content']

    # tokenisation, stopwords removal and lemmatisation all at once
    # out: ['initial', 'tweet']
    tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]

    # pos tag, no need to lemmatise again after.
    # out: [('initial', 'JJ'), ('tweet', 'NN')]
    tweet = nltk.pos_tag(tweet)

    # save processed tweet into the new column
    df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet

So on overall all you need are 4 lines, one for getting the tweet string, two to preprocess the text, another one to store the tweet. You can add extra processing step paying attention to the output of each step (for example tokenisation return a list of strings, pos tagging return a list of tuples, reason why you are getting troubles).

If you want then you can create a list of lists containing all tweet in the dataframe:

# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]
all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]

Upvotes: 1

Sebastian Baltser
Sebastian Baltser

Reputation: 758

In the first part new_test is a list of strings. lemmatize_sentence expects a string so passing new_test will raise an error like the one you got. You would have to pass each string separately and then create a list from each lemmatized strings. So:

text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

should create a list of lemmatized sentences.

I actually once did a project that seems similar to what you are doing. I made the following function to lemmatize strings:

import lemmy, re

def remove_stopwords(lst):
    with open('stopwords.txt', 'r') as sw:
        #read the stopwords file 
        stopwords = sw.read().split('\n')
        return [word for word in lst if not word in stopwords]

def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.

    -- body_text: string or list of strings
    -- language: language of the passed string(s), e.g. 'en', 'da' etc.
    """

    if isinstance(body_text, str):
        body_text = [body_text] #Convert whatever passed to a list to support passing of single string

    if not hasattr(body_text, '__iter__'):
        raise TypeError('Passed argument should be a sequence.')

    lemmatizer = lemmy.load(language) #load lemmatizing dictionary

    lemma_list = [] #list to store each lemmatized string 

    word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words

    for string in body_text:
        #remove punctuation and split words
        matches = word_regex.findall(string)

        #split words and lowercase them unless they are all caps
        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]

        #remove words that are in the stopwords file
        if remove_stopwords_:
            lemmatized_string = remove_stopwords(lemmatized_string)

        #lemmatize each word and choose the shortest word of suggested lemmatizations
        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]

        #remove words that are in the stopwords file
        if remove_stopwords_:
            lemmatized_string = remove_stopwords(lemmatized_string)

        lemma_list.append(' '.join(lemmatized_string))

    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

You could have a look at that if you want, but don't feel obligated. I would be more than glad if it can help you get any ideas, I spent a lot of time trying to figure it out myself!

Let me know :-)

Upvotes: 1

Related Questions