Optimizing memory usage - Pandas/Python

Question

I'm currently working with a data set containing raw text which I should pre-process:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()

from autocorrect import spell

for df in [train_df, test_df]:
    df['comment_text'] = df['comment_text'].apply(lambda x: word_tokenize(str(x)))
    df['comment_text'] = df['comment_text'].apply(lambda x: [lemma.lemmatize(spell(word)) for word in x])
    df['comment_text'] = df['comment_text'].apply(lambda x: ' '.join(x))

Including the spell function, however, rises the memory usage til a point that I get a "Memory error". This doesn't happen without the usage of such function. I'm wondering if there is a way to optimize this process keeping the spell function (the data set has lots of misspelled words).

Phil Sheard · Accepted Answer

I haven't got access to your dataframe so this is a bit speculative, but here goes...

DataFrame.apply will run the lambda function on the whole column at once, so it is probably holding the progress in memory. Instead, you could convert the lambda function into a pre-defined function and use DataFrame.map instead, which applies the function element-wise instead.

def spellcheck_string(input_str):
    return [lemma.lemmatize(spell(word)) for word in x]

for df in [train_df, test_df]:
   # ...
    df['comment_text'] = df['comment_text'].map(spellcheck_string)
   # ...

Could you give this a try and see if it helps?

Optimizing memory usage - Pandas/Python

Answers (2)

Related Questions