Markus
Markus

Reputation: 127

Removing Words that contain non-ascii characters using Python

I am using the following function to strip out non-ascii characters

def removeNonAscii(s): 
    return "".join(filter(lambda x: ord(x)<128, s))

def removeNonAscii1(s): 
    return "".join(i for i in s if ord(i)<128)

I would now like to remove the entire word if it contains any non-ascii characters. I thought of measuring the length pre and post function application but I am confident that there is a more efficient way. Any ideas?

Upvotes: 2

Views: 1234

Answers (3)

Zaur Amikishiyev
Zaur Amikishiyev

Reputation: 378

I came up with the following function. I removes all words that contain any ASCII character but probably the range can be extended as desired.

def removeWordsWithASCII(s):
    " ".join(filter(lambda x: not re.search(r'[\x20-\x7E]', x), s.split(' ')))

Upvotes: 0

DYZ
DYZ

Reputation: 57033

The most clean (but not necessarily most efficient) way is to convert a word to a binary and attempt to decode it as ASCII. If the attempt fails, the word has non-ASCII characters:

def is_ascii(w):
  try:
    w.encode().decode("us-ascii")
    return True
  except UnicodeEncodeError:
    return False

Upvotes: 1

Scott Colby
Scott Colby

Reputation: 1430

If you define the word based on spaces, something like this might work:

def containsNonAscii(s):
    return any(ord(i)>127 for i in s)

words = sentence.split()
cleaned_words = [word for word in words if  not containsNonAscii(word)]
cleaned_sentence = ' '.join(cleaned_words)

Note that this will collapse repeated whitespace into just one space.

Upvotes: 3

Related Questions