SteP
SteP

Reputation: 69

How to speed up computation time for stopword removal and lemmatization in NLP

As part of pre-processing for a text classification model, I have added stopword removal and lemmatization steps, using the NLTK library. The code is below:

import pandas as pd
import nltk; nltk.download("all")
from nltk.corpus import stopwords; stop = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Stopwords removal


def remove_stopwords(entry):
  sentence_list = [word for word in entry.split() if word not in stopwords.words("english")]
  return " ".join(sentence_list)

df["Description_no_stopwords"] = df.loc[:, "Description"].apply(lambda x: remove_stopwords(x))

# Lemmatization

lemmatizer = WordNetLemmatizer()

def punct_strip(string):
  s = re.sub(r'[^\w\s]',' ',string)
  return s

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize_rows(entry):
  sentence_list = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in punct_strip(entry).split()]
  return " ".join(sentence_list)

df["Description - lemmatized"] = df.loc[:, "Description_no_stopwords"].apply(lambda x: lemmatize_rows(x))

The problem is that, when I pre-process a dataset with 27k entries (my test set), it takes 40-45 seconds for stopwords removal and just as long for lemmatization. By contrast, model evaluation only takes 2-3 seconds.

How can I re-write the functions to optimise computation speed? I have read something about vectorization, but the example functions were much simpler than the ones that I have reported, and I wouldn't know how to do it in this case.

Upvotes: 0

Views: 911

Answers (2)

Akarshan B20ES014
Akarshan B20ES014

Reputation: 1

#take 1:
def remove_stopwords1(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)


#take2

def remove_stopwords2(text):
    new_text = []
    l = text.split()
    stopword_list = stopwords.words('english')
    for word in l:
        if word in stopword_list:
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

start = time.time()
remove_stopwords1(df['review'][0])
time2 = time.time() - start
print(time2*50,000)


start = time.time()
df['review'] = df['review'].apply(remove_stopwords2)
time2 = time.time() - start
print(time2)

Time taken in take1 : 7k+ seconds
Time taken in take2 : 148 seconds

Upvotes: 0

AloneTogether
AloneTogether

Reputation: 26708

A similar question was asked here and suggests that you try caching the stopwords.words("english") object. In your method remove_stopwords you are creating the object every time you evaluate an entry. So, you can definitely improve that. Regarding your lemmatizer, as mentioned here, you can also cache your results to improve performance. I can imagine that your pandas operations are also quite expensive. You may consider converting your dataframe into an array or dictionary and then iterating over it. If you need a dataframe later, you can easily convert it back.

Upvotes: 1

Related Questions