Reputation: 9844
I'm currently working with a data set containing raw text which I should pre-process:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()
from autocorrect import spell
for df in [train_df, test_df]:
df['comment_text'] = df['comment_text'].apply(lambda x: word_tokenize(str(x)))
df['comment_text'] = df['comment_text'].apply(lambda x: [lemma.lemmatize(spell(word)) for word in x])
df['comment_text'] = df['comment_text'].apply(lambda x: ' '.join(x))
Including the spell
function, however, rises the memory usage til a point that I get a "Memory error". This doesn't happen without the usage of such function. I'm wondering if there is a way to optimize this process keeping the spell
function (the data set has lots of misspelled words).
Upvotes: 0
Views: 2700
Reputation: 917
Anyway, I would work with dask, you can divide your dataframe in chunks (divisions) and you can retrieve each part and work with it.
https://dask.pydata.org/en/latest/dataframe.html
Upvotes: 1
Reputation: 2162
I haven't got access to your dataframe so this is a bit speculative, but here goes...
DataFrame.apply
will run the lambda
function on the whole column at once, so it is probably holding the progress in memory. Instead, you could convert the lambda function into a pre-defined function and use DataFrame.map
instead, which applies the function element-wise instead.
def spellcheck_string(input_str):
return [lemma.lemmatize(spell(word)) for word in x]
for df in [train_df, test_df]:
# ...
df['comment_text'] = df['comment_text'].map(spellcheck_string)
# ...
Could you give this a try and see if it helps?
Upvotes: 2