Rex5
Rex5

Reputation: 767

Python: Is there a faster way than using autocorrect for spell correction?

I am doing sentiment analysis and have train and test csv files with a train dataframe (created after reading the csv files) which has columns text and sentiment.

Tried in google-colab:

!pip install autocorrect
from autocorrect import spell 
train['text'] = [' '.join([spell(i) for i in x.split()]) for x in train['text']]

But it's taking forever to come to a halt. Is there a better way to auto-correct the pandas column? How to do it?

P.S.: the dataset is large enough, having around 5000 rows and each train['text'] value has around 300 words and is of type str. I have not broken the train['text'] into sentences.

Upvotes: 1

Views: 1668

Answers (1)

Brad Solomon
Brad Solomon

Reputation: 40878

First, some sample data:

from typing import List
from autocorrect import spell
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

data_train: List[str] = fetch_20newsgroups(
    subset='train',
    categories=['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'],
    shuffle=True,
    random_state=444
).data

df = pd.DataFrame({"train": data_train})

Corpus size:

>>> df.shape
(2034, 1)

Mean length of document in characters:

>>> df["train"].str.len().mean()
1956.4896755162242

First observation: spell() (I've never used autocorrect) is reaallly slow. It takes 7.77s just on one document!

>>> first_doc = df.iat[0, 0]                                                                                                                                                                                                                                 
>>> len(first_doc.split())                                                                                                                                                                                                                                   
547
>>> first_doc[:100]                                                                                                                                                                                                                                          
'From: [email protected] (David B. Mckissock)\nSubject: Gibbons Outlines SSF Redesign Guida'
>>> %time " ".join((spell(i) for i in first_doc.split()))                                                                                                                                                                                                    
CPU times: user 7.77 s, sys: 159 ms, total: 7.93 s
Wall time: 7.93 s

So that function, rather than choosing between a vectorized Pandas method or .apply(), is probably your bottleneck. A back of the envelope calculation, given that this document is roughly 1/3 as long as the average, has your total, non-parallelized calculation time at 7.93 * 3 * 2034 == 48,388 seconds. Not pretty.

To that end, consider parallelization. This is a highly parallelization task: apply a CPU-bound, simple callable across a collection of documents. concurrent.futures has an easy API for this. At this point, you can take the data structure out of Pandas and into something lightweight, such as a list or tuple.

Example:

>>> corpus = df["train"].tolist()  # or just data_train from above...                                                                                                                                                                                        
>>> import concurrent.futures                                                                                                                                                                                                                                
>>> import os                                                                                                                                                                                                                                                
>>> os.cpu_count()                                                                                                                                                                                                                                           
24
>>> with concurrent.futures.ProcessPoolExecutor() as executor: 
...     corrected = executor.map(lambda doc: " ".join((spell(i) for i in doc)), corpus)

Upvotes: 3

Related Questions