Detecting mistakes in words and fix them when classifying text (NLP)

Question

Hi there✌🏼I make a neural network that classifies the text. First I need to prepare the text and I ran into the problem of "mistakes in words". How can they be found and corrected? And what ideas do you have? Thanks in advance!

Sebastian Dziadzio · Accepted Answer

You can correct spelling errors by maintaining a vocabulary and finding the closest valid word using a string metric like the Levenshtein distance. There are also some more advanced Python tools, like SpaCy Hunspell. That being said, if you plan to use pre-trained word embeddings I wouldn't worry too much about text normalisation, as the embeddings will likely contain most common spelling variants. You can check how many out-of-vocabulary words you have in your data to see if it's worth investing time in extra cleaning except for basic tokenisation (and converting everything to lowercase).

Detecting mistakes in words and fix them when classifying text (NLP)

Answers (1)

Related Questions