Reputation: 2723
I've got a list of strings, along the lines of list=[a,b,c,d,e]
.
When I call list[2]
, the string c
is displayed as ASCII; when I call print list[2]
, however, it's displayed as unicode. Why does this discrepancy exist?
Upvotes: 2
Views: 105
Reputation: 6855
This is mainly because strings in Python 2 are not text strings but byte strings.
I suppose you are in a REPL environment (a Python console). When you evaluate something in the console, you get its printed representation which is the same as calling print repr()
on the expression:
l = ['ñ']
l[0] # should output '\xc3\xb1'
print repr(l[0]) # should output the same
This is because your console is in UTF-8 mode (if you get a different representation for ñ
it is because your console uses some other text representation) so when you press ñ
you are actually entering two bytes 0xc3
and 0xb1
.
repr()
is a Python method that always returns a string. For primitive types, this string is a valid source to rebuild the value passed as parameter. This case it returns a string with a sequence of bytes that recreates another string with the ñ
encoded as UTF-8. To see this:
repr(l[0]) # should print a string within a string: "'\\xc3\\xb1'"
So when you print it (which is the same as just evaluating in the console), you get the same string without the outer quotes and the escaped characters properly replaced. I.e:
print repr(l[0]) # should output '\xc3\xb1'
But, when you print the value, i.e: print l[0]
, then you send those two bytes to the console. As the console is in UTF-8 mode, it decodes the sequence and translate it to only one character: ñ
. So:
print l[0] # should output ñ
If you want to store text strings, you must use the modifier u
before the string. This way:
text = u'ñ'
Now, when evaluating text you will see its Unicode codepoint:
text # should output u'\xf1'
And printing it should recreate the ñ
glyph:
print text # should output `ñ`
If you want to convert text
into a byte string representation, you need an encoding scheme (such as UTF-8):
text.encode('utf-8') == l[0] # should output True
Similarly, it you want the Unicode representation for l[0]
, you'll need to decode those bytes:
l[0].decode('utf-8') == text # should output True
All this said, notice in Python 3, default strings are indeed Unicode Strings and you need to prefix the literal notation with b
to produce byte strings.
Upvotes: 3
Reputation: 76244
It's because those two ways of displaying a string use different routes to get to the final result. x
by itself in the REPL will invoke repr(x)
and display that, but print(x)
will invoke str(x)
and display that instead. Classes are allowed to define __repr__
and __str__
separately, so they don't always return the same value.
>>> x = u"a"
>>> x
u'a'
>>> print x
a
>>> repr(x)
"u'a'"
>>> str(x)
'a'
>>>
Upvotes: 2