Reputation: 11030
I have a string with a bunch of non-ASCII characters and I would like to remove it. I used the following function in Python 3:
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
str1 = "Hi there!\xc2\xa0My\xc2\xa0name\xc2\xa0is\xc2\xa0Blue "
new = removeNonAscii(str1)
The new string becomes:
Hi there!MynameisBlue
Is it possible to get spaces between the string such that it is:
Hi there! My name is Blue
Upvotes: 1
Views: 373
Reputation: 133504
regex wins here, but FWIW here is an itertools.groupby
solution:
from itertools import groupby
text = "Hi there!\xc2\xa0My\xc2\xa0name\xc2\xa0is\xc2\xa0Blue "
def valid(c):
return ord(c) < 128
def removeNonAscii(s):
return ''.join(''.join(g) if k else ' ' for k, g in groupby(s, valid))
>>> removeNonAscii(text)
'Hi there! My name is Blue '
Upvotes: 0
Reputation: 56809
The code below is equivalent to your current code, except that for a contiguous sequence of characters outside the range of US-ASCII, it will replace the whole sequence with a single space (ASCII 32).
import re
re.sub(r'[^\x00-\x7f]+', " ", inputString)
Do note that control characters are allowed by the code above, and also the code in the question.
Upvotes: 3