lydias
lydias

Reputation: 841

NLP how to speed up spelling correction on 147k rows filled with short messages

Trying to speed up spelling check on large dataset with 147k rows. The following function has been running for an entire afternoon and is still running. Is there a way to speed up the spelling check? The messages has already been case treated, punctuations removed, lemmatized and they are all in string format.

import autocorrect
from autocorrect import Speller
spell = Speller()

def spell_check(x):
    correct_word = []
    mispelled_word = x.split()
    for word in mispelled_word:
        correct_word.append(spell(word))
    return ' '.join(correct_word)

df['clean'] = df['old'].apply(spell_check)

Upvotes: 0

Views: 1171

Answers (2)

EliasK93
EliasK93

Reputation: 3174

Additionally to what @Amadan said and is definitely true (autocorrect does the correction in a very ineffective way):

You treat each word in the giant dataset as if all words in it are looked up for the first time, because you call spell() on each word. In reality (at least after a while) almost all words were previously looked up, so storing these results and loading them would be much more efficient.

Here is one way to do it:

import autocorrect
from autocorrect import Speller
spell = Speller()

# get all unique words in the data as a set (first split each row into words, then put them all in a flat set)
unique_words = {word for words in df["old"].apply(str.split) for word in words}

# get the corrected version of each unique word and put this mapping in a dictionary
corrected_words = {word: spell(word) for word in unique_words}

# write the cleaned row by looking up the corrected version of each unique word
df['clean'] = [" ".join([corrected_words[word] for word in row.split()]) for row in df["old"]]

Upvotes: 2

Amadan
Amadan

Reputation: 198334

The autocorrect library is not very efficient, and is not made for tasks such as you present it. What it does is generate all the possible candidates with one or two typos, and checks which of them are valid words — and does it in plain Python.

Take a six-letter word like "source":

from autocorrect.typos import Word
print(sum(1 for c in Word('source').typos()))
# => 349
print(sum(1 for c in Word('source').double_typos()))
# => 131305

autocorrect generates as many as 131654 candidates to test, just for this word. What if it is longer? Let's try "transcompilation":

print(sum(1 for c in Word('').typos()))
# => 889
print(sum(1 for c in Word('').double_typos()))
# => 813325

That's 814214 candidates, just for one word! And note that numpy can't speed it up, as the values are Python strings, and you're invoking a Python function on every row. The only way to speed this up is to change the method you are using for spell-checking: for example, using aspell-python-py3 library instead (a wrapper for aspell, AFAIK the best free spellchecker for Unix).

Upvotes: 3

Related Questions