Mauro Gentile
Mauro Gentile

Reputation: 1511

I need to remove all invisible characters in python

i have a long text which i need to be as clean as possible. I have collapsed multiple spaces in one space only. I have removed \n and \t. I stripped the resulting string.

I then found characters like \u2003 and \u2019 What are these? How do I make sure that in my text I will have removed all special characters?

Besides the \n \t and the \u2003, should I check for more characters to remove? I am using python 3.6

Upvotes: 0

Views: 9269

Answers (2)

PythonNewbe
PythonNewbe

Reputation: 41

Thank you Mike Peder, this solution worked for me. However I had to do it for both sides of the comparison

if((re.sub('[^!-~]+',' ',date).strip())==(re.sub('[^!-~]+',' ',calendarData[i]).strip())):

Upvotes: 0

Mike Peder
Mike Peder

Reputation: 738

Try this:

import re
# string contains the \u2003 character
string = u'This is a   test string ’'
# this regex will replace all special characters with a space
re.sub('\W+',' ',string).strip()

Result

'This is a test string'

If you want to preserve ascii special characters:

re.sub('[^!-~]+',' ',string).strip()

This regex reads: select [not characters 34-126] one or more times, where characters 34-126 are the visible range of ascii.

In regex , the ^ says not and the - indicates a range. Looking at an ascii table, 32 is space and all characters below are either a button interrupt or another form of white space like tab and newline. Character 33 is the ! mark and the last displayable character in ascii is 126 or ~.

Upvotes: 2

Related Questions