I need to remove all invisible characters in python

Question

i have a long text which i need to be as clean as possible. I have collapsed multiple spaces in one space only. I have removed and . I stripped the resulting string.

I then found characters like \u2003 and \u2019 What are these? How do I make sure that in my text I will have removed all special characters?

Besides the and the \u2003, should I check for more characters to remove? I am using python 3.6

Mike Peder · Accepted Answer

Try this:

import re
# string contains the \u2003 character
string = u'This is a   test string ’'
# this regex will replace all special characters with a space
re.sub('\W+',' ',string).strip()

Result

'This is a test string'

If you want to preserve ascii special characters:

re.sub('[^!-~]+',' ',string).strip()

This regex reads: select [not characters 34-126] one or more times, where characters 34-126 are the visible range of ascii.

In regex , the ^ says not and the - indicates a range. Looking at an ascii table, 32 is space and all characters below are either a button interrupt or another form of white space like tab and newline. Character 33 is the ! mark and the last displayable character in ascii is 126 or ~.

I need to remove all invisible characters in python

Answers (2)

Related Questions