Reputation: 127
I am using the following function to strip out non-ascii characters
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
def removeNonAscii1(s):
return "".join(i for i in s if ord(i)<128)
I would now like to remove the entire word if it contains any non-ascii characters. I thought of measuring the length pre and post function application but I am confident that there is a more efficient way. Any ideas?
Upvotes: 2
Views: 1234
Reputation: 378
I came up with the following function. I removes all words that contain any ASCII character but probably the range can be extended as desired.
def removeWordsWithASCII(s):
" ".join(filter(lambda x: not re.search(r'[\x20-\x7E]', x), s.split(' ')))
Upvotes: 0
Reputation: 57033
The most clean (but not necessarily most efficient) way is to convert a word to a binary and attempt to decode it as ASCII. If the attempt fails, the word has non-ASCII characters:
def is_ascii(w):
try:
w.encode().decode("us-ascii")
return True
except UnicodeEncodeError:
return False
Upvotes: 1
Reputation: 1430
If you define the word based on spaces, something like this might work:
def containsNonAscii(s):
return any(ord(i)>127 for i in s)
words = sentence.split()
cleaned_words = [word for word in words if not containsNonAscii(word)]
cleaned_sentence = ' '.join(cleaned_words)
Note that this will collapse repeated whitespace into just one space.
Upvotes: 3