Why doesn't print() in Python 3.2 seem to be defaulting to UTF-8?

Question

I'm writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I'm doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:

print(label, msg)
... in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to

If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:

print(label, msg.encode())

This also works for printing objects or lists containing unicode strings--something I often have to do when debugging--since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I'm not confident about handling all the argument permutations that print() supports.

Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:

msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)

Applying regex 5 of 15: ^\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*

However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):

msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())

b'Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*'

I'm not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.

georg · Accepted Answer

print doesn't default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).

Why doesn't print() in Python 3.2 seem to be defaulting to UTF-8?

Answers (2)

Related Questions

Why doesn&#39;t print() in Python 3.2 seem to be defaulting to UTF-8?

Answers (2)

Related Questions

Why doesn't print() in Python 3.2 seem to be defaulting to UTF-8?