Reputation: 259
When I was trying to read a text file with the following python code:
with open(file, 'r') as myfile:
data = myfile.read()
Got some weird characters start with \x...., what do they stand for and how to get rid of them in reading a text file?
e.g.
...... \xc2\xa0 \xc2\xa0 chapter 1 tuesday 1984 \xe2\x80\x9chey , jake , your mom sent me to pick you up \xe2\x80\x9d jacob robbins knew better than to accept a ride from a stranger , but when his mom\xe2\x80\x99s friend ronny was waiting for him in front of school he reluctantly got in the car \xe2\x80\x9cmy name is jacob........
Upvotes: 4
Views: 11874
Reputation: 1
the below code clears the issue
path.decode('utf-8','ignore').strip()
Upvotes: 0
Reputation: 1
def main():
args = parse_args()
if args.file :
//To clean \xc2\xa0 \xc2\xa0… in text data
file_to_read = args.file.decode('utf-8','ignore').strip()
f = open(file_to_read, "r+")
text_from_file = f.read()
else :
text_from_file = sys.argv[1]
Upvotes: 0
Reputation: 4155
Those are string escapes. They represent a character by its hexadecimal value. For example, \x24
is 0x24
, which is the dollar sign.
>>> '\x24'
'$'
>>> chr(0x24)
'$'
One such escape (from the ones you provided) is \xc2
which is Â
, a capital A with a circumflex.
Upvotes: 2
Reputation: 799230
That's UTF-8 encoded text. You open the file as UTF-8.
with open(file, 'r', encoding='utf-8') as myfile:
...
2.x:
with codecs.open(file, 'r', encoding='utf-8') as myfile:
...
Unicode In Python, Completely Demystified
Upvotes: 7