How to remove all the spaces between letters?

I have text with words like this: a n a l i z e, c l a s s etc. But there are normal words as well. I need to remove all these spaces between letters of words.

reg_let = re.compile('\s[А-Яа-яёЁa-zA-Z](\s)', re.DOTALL)
text = 'T h i s is exactly w h a t I needed'
text = re.sub(reg_let, '', text)
text

OUTPUT: 'Tiis exactlyhtneeded' (while I need - 'This is exactly what I needed')

Upvotes: 0

Views: 123

Answers (2)

Nikaido
Nikaido

Reputation: 4629

There is no easy solution to this problem.

The only solution that I can think of is the one in which is used a dictionary to check if a word is correct or no (present in the english dictionary).

But even doing so you'll get a lot of false positives. For example if I got the text:

a n a n a s

the words:

  • a
  • an
  • as

are all correct in the english dictionary. How do I split the text? For me, as human who can read a text, it is clear that the word here is ananas. But one could split the text as such:

an an as

Which is correct grammatically, but doesn't make sense in english. The correctness is given by the context. I, as human, I can understand the context. One could split, concat the string in different ways to check if it makes sense. But unfortunately there is no library, or simple procedure that can understand context.

Machine Learning could be a way, but there is no perfect solution.

Upvotes: 1

Elad Cohen
Elad Cohen

Reputation: 471

As far as I know, there is no easy way to do it because your biggest problem is to distinct the words with meaning, in other words, you need some semantic engine to tell you which word is meaningful to the sentence.

The only thing I can think of is a word embedding model, without anything like that you can clear as much spaces as you want but you cant distinct the words, meaning you'll never know which spaces to not remove.

I would love if someone will fix me if theres a simpler way im not aware of.

Upvotes: 1

Related Questions