Reputation: 479
Python experts:
I have a sentence like:
"this time air\u00e6\u00e3o was filled\u00e3o"
I wish to remove the non-Ascii unicode characters.
I can just the following code and function:
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
sentence = "this time air\u00e6\u00e3o was filled\u00e3o"
sentence = removeNonAscii(sentence)
print(sentence)
then it shows up: "this time airo was filledo"
, works great to remove "\00.."
but when I write the sentence in a file, and then read it and make as a loop:
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
hand = open('test.txt')
for sentence in hand:
sentence = removeNonAscii(sentence)
print(sentence)
it shows "this time air\u00e6\u00e3o was filled\u00a3o"
it doesn't work at all. What happens here? if the function works, it should not
be that way....
Upvotes: 3
Views: 3085
Reputation: 5515
I have a feeling that instead of having the actual non-ascii
characters, the text in your file is actually displaying the utf-8 sequence for the character, ie instead of whatever character you think is there, it is actually the code \u00--
and so when you run your code, it reads every character and sees that they are completely fine so the filter leaves them.
IF this is the case, use this:
import re
def removeNonAscii(s):
return re.sub(r'\\u\w{4}','',s)
and it will take away all instances of '\u----'
example:
>>> with open(r'C:\Users\...\file.txt','r') as f:
for line in f:
print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo
where file.txt has:
this time air\u00e6\u00e3o was filled\u00a3o
Upvotes: 2