Reputation: 33273
I am reading some data from file..
But there are some weird characters I am observing like;
'tamb\xc3\xa9m', 'f\xc3\xbcr','cari\xc3\xb1o'
My file read code is fairly standard:
with open(filename) as f:
for line in f:
print line
Upvotes: 3
Views: 11075
Reputation: 1123450
You have UTF-8 encoded data. You could decode the data:
with open(filename) as f:
for line in f:
print line.decode('utf8')
or use io.open()
to have Python decode the contents for you, as you read:
import io
with io.open(filename, encoding='utf8') as f:
for line in f:
print line
Your data, decoded:
>>> print 'tamb\xc3\xa9m'.decode('utf8')
também
>>> print 'f\xc3\xbcr'.decode('utf8')
für
>>> print 'cari\xc3\xb1o'.decode('utf8')
cariño
You appear to have printed string representations, (the output of the repr()
function), which produces string literal syntax suitable for pasting back into your Python interpreter. \xhh
hex codes are used for characters outside of the printable ASCII range. Python containers such as list
or dict
also use repr()
to show their contents, when printed.
You may want to read up on Unicode, and how it interacts with Python. See:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Upvotes: 11