pceccon
pceccon

Reputation: 9844

Optimizing memory usage - Pandas/Python

I'm currently working with a data set containing raw text which I should pre-process:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()

from autocorrect import spell

for df in [train_df, test_df]:
    df['comment_text'] = df['comment_text'].apply(lambda x: word_tokenize(str(x)))
    df['comment_text'] = df['comment_text'].apply(lambda x: [lemma.lemmatize(spell(word)) for word in x])
    df['comment_text'] = df['comment_text'].apply(lambda x: ' '.join(x))

Including the spell function, however, rises the memory usage til a point that I get a "Memory error". This doesn't happen without the usage of such function. I'm wondering if there is a way to optimize this process keeping the spell function (the data set has lots of misspelled words).

enter image description here

Upvotes: 0

Views: 2700

Answers (2)

Julio CamPlaz
Julio CamPlaz

Reputation: 917

Anyway, I would work with dask, you can divide your dataframe in chunks (divisions) and you can retrieve each part and work with it.

https://dask.pydata.org/en/latest/dataframe.html

Upvotes: 1

Phil Sheard
Phil Sheard

Reputation: 2162

I haven't got access to your dataframe so this is a bit speculative, but here goes...

DataFrame.apply will run the lambda function on the whole column at once, so it is probably holding the progress in memory. Instead, you could convert the lambda function into a pre-defined function and use DataFrame.map instead, which applies the function element-wise instead.

def spellcheck_string(input_str):
    return [lemma.lemmatize(spell(word)) for word in x]

for df in [train_df, test_df]:
   # ...
    df['comment_text'] = df['comment_text'].map(spellcheck_string)
   # ...

Could you give this a try and see if it helps?

Upvotes: 2

Related Questions