In python, why does calling a string, "X", display it in ASCII, but calling "print X" display it in unicode?

Question

I've got a list of strings, along the lines of list=[a,b,c,d,e].

When I call list[2], the string c is displayed as ASCII; when I call print list[2], however, it's displayed as unicode. Why does this discrepancy exist?

Salva · Accepted Answer

This is mainly because strings in Python 2 are not text strings but byte strings.

I suppose you are in a REPL environment (a Python console). When you evaluate something in the console, you get its printed representation which is the same as calling print repr() on the expression:

l = ['ñ']
l[0] # should output '\xc3\xb1'
print repr(l[0]) # should output the same

This is because your console is in UTF-8 mode (if you get a different representation for ñ it is because your console uses some other text representation) so when you press ñ you are actually entering two bytes 0xc3 and 0xb1.

repr() is a Python method that always returns a string. For primitive types, this string is a valid source to rebuild the value passed as parameter. This case it returns a string with a sequence of bytes that recreates another string with the ñ encoded as UTF-8. To see this:

repr(l[0]) # should print a string within a string: "'\xc3\xb1'"

So when you print it (which is the same as just evaluating in the console), you get the same string without the outer quotes and the escaped characters properly replaced. I.e:

print repr(l[0]) # should output '\xc3\xb1'

But, when you print the value, i.e: print l[0], then you send those two bytes to the console. As the console is in UTF-8 mode, it decodes the sequence and translate it to only one character: ñ. So:

print l[0] # should output ñ

If you want to store text strings, you must use the modifier u before the string. This way:

text = u'ñ'

Now, when evaluating text you will see its Unicode codepoint:

text # should output u'\xf1'

And printing it should recreate the ñ glyph:

print text # should output `ñ`

If you want to convert text into a byte string representation, you need an encoding scheme (such as UTF-8):

text.encode('utf-8') == l[0] # should output True

Similarly, it you want the Unicode representation for l[0], you'll need to decode those bytes:

l[0].decode('utf-8') == text # should output True

All this said, notice in Python 3, default strings are indeed Unicode Strings and you need to prefix the literal notation with b to produce byte strings.

In python, why does calling a string, "X", display it in ASCII, but calling "print X" display it in unicode?

Answers (2)

Related Questions

In python, why does calling a string, &quot;X&quot;, display it in ASCII, but calling &quot;print X&quot; display it in unicode?

Answers (2)

Related Questions

In python, why does calling a string, "X", display it in ASCII, but calling "print X" display it in unicode?