Aleksandr Lozko
Aleksandr Lozko

Reputation: 43

Detecting mistakes in words and fix them when classifying text (NLP)

Hi there✌🏼I make a neural network that classifies the text. First I need to prepare the text and I ran into the problem of "mistakes in words". How can they be found and corrected? And what ideas do you have? Thanks in advance!

Upvotes: 1

Views: 806

Answers (1)

Sebastian Dziadzio
Sebastian Dziadzio

Reputation: 530

You can correct spelling errors by maintaining a vocabulary and finding the closest valid word using a string metric like the Levenshtein distance. There are also some more advanced Python tools, like SpaCy Hunspell. That being said, if you plan to use pre-trained word embeddings I wouldn't worry too much about text normalisation, as the embeddings will likely contain most common spelling variants. You can check how many out-of-vocabulary words you have in your data to see if it's worth investing time in extra cleaning except for basic tokenisation (and converting everything to lowercase).

Upvotes: 2

Related Questions