pg2455
pg2455

Reputation: 5148

How to remove special characters from strings in python?

I have millions of strings scraped from web like:

s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True

Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:

\\x.*[0-9]

Upvotes: 1

Views: 2284

Answers (2)

pg2455
pg2455

Reputation: 5148

This thing worked for me as mentioned by Padriac in comments:

s.decode('ascii', errors='ignore')

Upvotes: 2

Cory Kramer
Cory Kramer

Reputation: 117856

The special characters are not actually multiple characters long, that is just how they are represented so your regex isn't going to work. If you print you will see the actual unicode (utf-8) characters

>>> s = 'WHAT\xe2\x80\x99S UP DOC?'
>>> print(s)
WHATâS UP DOC?
>>> repr(s)
"'WHATâ\\x80\\x99S UP DOC?'"

If you want to print only the ascii characters you can check if the character is in string.printable

>>> import string
>>> ''.join(i for i in s if i in string.printable)
'WHATS UP DOC?'

Upvotes: 3

Related Questions