Reputation: 348
I am in python3.3.
What is the difference between a unicode string and a binary string?
b'\\u4f60'
u'\x4f\x60'
b'\x4f\x60'
u'4f60'
The concept of Unicode and binary string is confusing. How can i change b'\\u4f60'
into b'\x4f\x60'
?
Upvotes: 4
Views: 3745
Reputation: 25954
First - there is no difference between unicode literals and string literals in python 3. They are one and the same - you can drop the u
up front. Just write strings. So instantly you should see that the literal u'4f60'
is just like writing actual '4f60'
.
A bytes
literal - aka b'some literal'
- is a series of bytes. Bytes between 32 and 127 (aka ASCII) can be displayed as their corresponding glyph, the rest are displayed as the \x
escaped version. Don't be confused by this - b'\x61'
is the same as b'a'
. It's just a matter of printing.
A string literal is a string literal. It can contain unicode codepoints. There is far too much to cover to explain how unicode works here, but basically a codepoint represents a glyph (essentially, a character - a graphical representation of a letter/digit), it does not specify how the machine needs to represent it. In fact there are a great many different ways.
Thus there is a very large difference between bytes
literals and str
literals. The former describe the machine representation, the latter describe the alphanumeric glyphs that we are reading right now. The mapping between the two domains is encoding/decoding.
I'm skipping over a lot of vital information here. That should get us somewhere though. I highly recommend reading more since this is not an easy topic.
How can i change
b'\\u4f60'
intob'\x4f\x60'
?
Let's walk through it:
b'\u4f60'
Out[101]: b'\\u4f60' #note, unicode-escaped
b'\x4f\x60'
Out[102]: b'O`'
'\u4f60'
Out[103]: '你'
So, notice that \u4f60
is that Han ideograph glyph. \x4f\x60
is, if we represent it in ascii (or utf-8, actually), the letter O (\x4f
) followed by backtick.
I can ask python to turn that unicode-escaped bytes
sequence into a valid string with the according unicode glyph:
b'\\u4f60'.decode('unicode-escape')
Out[112]: '你'
So now all we need to do is to re-encode
to bytes, right? Well...
Coming around to what I think you're wanting to ask -
How can i change
'\\u4f60'
into its proper bytes representation?
There is no 'proper' bytes representation of that unicode codepoint. There is only a representation in the encoding that you want. It so happens that there is one encoding that directly matches the transformation to b'\x4f\x60'
- utf-16be
.
b'\\u4f60'.decode('unicode-escape').encode('utf-16-be')
Out[47]: 'O`'
The reason this works is that utf-16
is a variable-length encoding. For code points below 16 bits it just directly uses the codepoint as the 2-byte encoding, and for points above it uses something called "surrogate pairs", which I won't get into.
Upvotes: 6