NLP way to remove typos?

Question

An example, "5 average age of card members joined last year". Obviously "5" is a typo and I would like to normalize this sentence to "average age of card members joined last year" before further processing. What NLP technique can I use for this task?

Jindřich · Accepted Answer

Standardizing input by removing typos is not a usual way of text preprocessing in NLP.

Automatic grammar correction (which includes fixing obvious typos) is a rather complicated task and solutions that work well are computationally demanding. Currently, the best results are achieved by large deep learning models. You can download and directly use some models from the HuggingFace Model Hub. As a more lightweight solution, you can try applying a spell-checker or writing some rules that suit your data well.

Rather than removing errors in the pre-processing step, the usual approach is to make NLP models and algorithms robust towards source noise. In simple statistical models, this is usually achieved by only considering words (or word n-grams) that only appear several times in the training data. Large neural models typically get robust by large-scale pre-training on all available data.

NLP way to remove typos?

Answers (1)

Related Questions