Reputation: 1511
i have a long text which i need to be as clean as possible. I have collapsed multiple spaces in one space only. I have removed \n and \t. I stripped the resulting string.
I then found characters like \u2003 and \u2019 What are these? How do I make sure that in my text I will have removed all special characters?
Besides the \n \t and the \u2003, should I check for more characters to remove? I am using python 3.6
Upvotes: 0
Views: 9269
Reputation: 41
Thank you Mike Peder, this solution worked for me. However I had to do it for both sides of the comparison
if((re.sub('[^!-~]+',' ',date).strip())==(re.sub('[^!-~]+',' ',calendarData[i]).strip())):
Upvotes: 0
Reputation: 738
Try this:
import re
# string contains the \u2003 character
string = u'This is a test string ’'
# this regex will replace all special characters with a space
re.sub('\W+',' ',string).strip()
Result
'This is a test string'
If you want to preserve ascii special characters:
re.sub('[^!-~]+',' ',string).strip()
This regex reads: select [not characters 34-126] one or more times, where characters 34-126 are the visible range of ascii.
In regex , the ^
says not and the -
indicates a range. Looking at an ascii table, 32 is space
and all characters below are either a button interrupt or another form of white space like tab
and newline
. Character 33 is the !
mark and the last displayable character in ascii is 126 or ~
.
Upvotes: 2