Removing non-ASCII characters from file text

Question

Python experts:

I have a sentence like: "this time air\u00e6\u00e3o was filled\u00e3o" I wish to remove the non-Ascii unicode characters. I can just the following code and function:

def removeNonAscii(s): 
    return "".join(filter(lambda x: ord(x)<128, s))          

sentence = "this time air\u00e6\u00e3o was filled\u00e3o"   
sentence = removeNonAscii(sentence)
print(sentence)

then it shows up: "this time airo was filledo", works great to remove "\00.." but when I write the sentence in a file, and then read it and make as a loop:

def removeNonAscii(s):
    return "".join(filter(lambda x: ord(x)<128, s))

hand = open('test.txt')
for sentence in hand:
    sentence = removeNonAscii(sentence)
    print(sentence)

it shows "this time air\u00e6\u00e3o was filled\u00a3o" it doesn't work at all. What happens here? if the function works, it should not be that way....

R Nar · Accepted Answer

I have a feeling that instead of having the actual non-ascii characters, the text in your file is actually displaying the utf-8 sequence for the character, ie instead of whatever character you think is there, it is actually the code \u00-- and so when you run your code, it reads every character and sees that they are completely fine so the filter leaves them.

IF this is the case, use this:

import re
def removeNonAscii(s):
    return re.sub(r'\u\w{4}','',s)

and it will take away all instances of '\u----'

example:

>>> with open(r'C:\Users\...\file.txt','r') as f:
    for line in f:
        print(re.sub(r'\u\w{4}','',line))
this time airo was filledo

where file.txt has:

this time air\u00e6\u00e3o was filled\u00a3o

Removing non-ASCII characters from file text

Answers (1)

Related Questions