Reputation: 43
Hi there✌🏼I make a neural network that classifies the text. First I need to prepare the text and I ran into the problem of "mistakes in words". How can they be found and corrected? And what ideas do you have? Thanks in advance!
Upvotes: 1
Views: 806
Reputation: 530
You can correct spelling errors by maintaining a vocabulary and finding the closest valid word using a string metric like the Levenshtein distance. There are also some more advanced Python tools, like SpaCy Hunspell. That being said, if you plan to use pre-trained word embeddings I wouldn't worry too much about text normalisation, as the embeddings will likely contain most common spelling variants. You can check how many out-of-vocabulary words you have in your data to see if it's worth investing time in extra cleaning except for basic tokenisation (and converting everything to lowercase).
Upvotes: 2