Reputation: 841
Trying to speed up spelling check on large dataset with 147k rows. The following function has been running for an entire afternoon and is still running. Is there a way to speed up the spelling check? The messages has already been case treated, punctuations removed, lemmatized and they are all in string format.
import autocorrect
from autocorrect import Speller
spell = Speller()
def spell_check(x):
correct_word = []
mispelled_word = x.split()
for word in mispelled_word:
correct_word.append(spell(word))
return ' '.join(correct_word)
df['clean'] = df['old'].apply(spell_check)
Upvotes: 0
Views: 1171
Reputation: 3174
Additionally to what @Amadan said and is definitely true (autocorrect does the correction in a very ineffective way):
You treat each word in the giant dataset as if all words in it are looked up for the first time, because you call spell()
on each word. In reality (at least after a while) almost all words were previously looked up, so storing these results and loading them would be much more efficient.
Here is one way to do it:
import autocorrect
from autocorrect import Speller
spell = Speller()
# get all unique words in the data as a set (first split each row into words, then put them all in a flat set)
unique_words = {word for words in df["old"].apply(str.split) for word in words}
# get the corrected version of each unique word and put this mapping in a dictionary
corrected_words = {word: spell(word) for word in unique_words}
# write the cleaned row by looking up the corrected version of each unique word
df['clean'] = [" ".join([corrected_words[word] for word in row.split()]) for row in df["old"]]
Upvotes: 2
Reputation: 198334
The autocorrect
library is not very efficient, and is not made for tasks such as you present it. What it does is generate all the possible candidates with one or two typos, and checks which of them are valid words — and does it in plain Python.
Take a six-letter word like "source"
:
from autocorrect.typos import Word
print(sum(1 for c in Word('source').typos()))
# => 349
print(sum(1 for c in Word('source').double_typos()))
# => 131305
autocorrect
generates as many as 131654 candidates to test, just for this word. What if it is longer? Let's try "transcompilation"
:
print(sum(1 for c in Word('').typos()))
# => 889
print(sum(1 for c in Word('').double_typos()))
# => 813325
That's 814214 candidates, just for one word! And note that numpy
can't speed it up, as the values are Python strings, and you're invoking a Python function on every row. The only way to speed this up is to change the method you are using for spell-checking: for example, using aspell-python-py3
library instead (a wrapper for aspell
, AFAIK the best free spellchecker for Unix).
Upvotes: 3