Jon Coombs
Jon Coombs

Reputation: 2226

Why doesn't print() in Python 3.2 seem to be defaulting to UTF-8?

I'm writing scripts to clean up unicode text files (stored as UTF-8), and I chose to use Python 3.x (3.2) rather than the more popular 2.x because 3.x is supposed to default to UTF-8. Maybe I'm doing something wrong, but it seems that the print statement, at least, still is not defaulting to UTF-8. If I try to print a string (msg below is a string) that contains special characters, I still get a UnicodeEncodeError like this:

print(label, msg)
... in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\u0968' in position
38: character maps to <undefined>

If I use the encode() method first (which does nicely default to UTF-8), I can avoid the error:

print(label, msg.encode())

This also works for printing objects or lists containing unicode strings--something I often have to do when debugging--since str() seems to default to UTF-8. But do I really need to remember to use print(str(myobj).encode()) every single time I want to do a print(myobj) ? If so, I suppose I could try to wrap it with my own function, but I'm not confident about handling all the argument permutations that print() supports.

Also, my script loads regular expressions from a file and applies them one by one. Before applying encode(), I was able to print something fairly legible to the console:

msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg)

Applying regex 5 of 15: ^\\ge[0-9]*\b([ ]+[0-9]+\.)?[ ]*

However, this crashes if the regex includes literal unicode characters, so I applied encode() to the string first. But now the regexes are very hard to read on-screen (and I suspect I may have similar trouble if I try to write code that saves these regexes back to disk):

msg = 'Applying regex {} of {}: {}'.format(i, len(regexes), regex._findstr)
print(msg.encode())

b'Applying regex 5 of 15: ^\\\\ge[0-9]*\\b([ ]+[0-9]+\\.)?[ ]*'

I'm not very experienced yet in Python, so I may be misunderstanding. Any explanations or links to tutorials (for Python 3.x; most of what I see online is for 2.x) would be much appreciated.

Upvotes: 2

Views: 655

Answers (2)

georg
georg

Reputation: 214949

print doesn't default to any encoding, it just uses whatever encoding the output device (like a console) claims to support. Your console encoding appears to be non-unicode, so print tries to encode your unicode strings in that encoding, and fails. The easiest way to get around this is to tell the console to use utf8 (like export LC_ALL=en_US.UTF-8 on unix systems).

Upvotes: 6

Thomas Orozco
Thomas Orozco

Reputation: 55207

The easier way to proceed is to only use unicode in your script, and only use encoded data when you want to interact with the "outside" world. That is, when you have input to decode or output to encode.

To do so, everytime you read something, use decode, everytime you output something, use encode.

For you regex, use the re.UNICODE flag.

I know that this doesn't exactly answer your question point by point, but I think that applying such a methodology should keep you safe from encoding issues.

Upvotes: 2

Related Questions