Reputation: 4295
I have a UTF8 String piped from Java to python.
The end result is
'\xe0\xb8\x9a\xe0\xb8\x99'
Hence for example
a = '\xe0\xb8\x9a\xe0\xb8\x99'
a.decode('utf-8')
gives me the result
u'\u0e1a\u0e19'
however, what i am curious is since the bytes is piped in as UTF-8, why would be
'\xe0\xb8\x9a\xe0\xb8\x99'
instead of u'\u0e1a\u0e19'
.
If i were to encode (u'\u0e1a\u0e19')
i would get back '\xe0\xb8\x9a\xe0\xb8\x99'.
So what is the inherent difference between these two and how i do actually understand when to use decode and encode.
Upvotes: 0
Views: 3404
Reputation: 18898
UTF8 String is insufficient to describe the statement '\xe0\xb8\x9a\xe0\xb8\x99'
is; it really should be called UTF8 encoding of a unicode string.
Python 2's unicode
type and Python 3's str
type represents a string of unicode code points, so the statement u'\u0e1a\u0e19'
is the python representation of the two code points U+0E1A U+0E19
and in human terms it will be rendered as บน
.
As for explaining the whole encode
and decode
calls, we will use your example. What you got back from Java is a stream of raw bytes, and so to make it useful as human text you need to decode
'\xe0\xb8\x9a\xe0\xb8\x99'
as a utf-8
encoded input in order to get that back into what unicode code points they represent (which is u'\u0e1a\u0e19'
). Calling encode
on that string of unicode code points back into a list of bytes (which in Python 2 it will be in str
type and Python 3 it will be actually be the bytes
type) will get back to the series of bytes that is '\xe0\xb8\x9a\xe0\xb8\x99'
.
Of course, you can encode those unicode code points into other encoding such as UTF16 encoding which on little endian platforms it will result in the bytes '\xff\xfe\x1a\x0e\x19\x0e'
, or use encode those code points into non-unicode encoding. As this looks like Thai we can use the iso8859-11
encoding for this, which will be encoded into the bytes '\xba\xb9'
- but this is not cross platform as it will only be shown as Thai on systems configured for this particular encoding. This is one of the reasons why Unicode was invented as these bytes '\xba\xb9'
could be decoded using the iso8859-1
encoding which would be rendered as º¹
or iso8859-11
as บน
.
In short, '\xe0\xb8\x9a\xe0\xb8\x99'
is the UTF8 encoding of the unicode code points for u'\u0e1a\u0e19'
in Python syntax. Raw bytes (coming through the wire, read from a file) are generally not in the form of unicode code points and they must be decoded into unicode code points. Unicode code points are not an encoding and when sent across the wire (or written to a file) must be encoded into some kind of byte representation for the unicode code points, which in many cases is utf-8 as it has the greatest portability.
Lastly, you should read this: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Upvotes: 2
Reputation: 44376
'\xe0\xb8\x9a\xe0\xb8\x99' is simply a series of bytes. You have chosen to interpret that as UTF-8, and when you do, you can decode it into a series of unicode characters, U+e1a and U+e19.
The sequence U+e1a, U+e19 can be represented as u'\u0e1a\u0e19', but in some sense that representation is as arbitrary as '\xe0\xb8\x9a\xe0\xb8\x99'. It is "natural", that's why Python prints them that way, but it's inefficent, which is why there are various other encoding schemes, including UTF-8
In fact, it's slightly misleading for me to say "'\xe0\xb8\x9a\xe0\xb8\x99' is a series of bytes." It is the default representation of a series of bytes, two hundred twenty-four, followed by one hundred eighty-four, and so on.
Python has a notion of a series of bytes, and it has a separate notion of series of unicode characters. encode
and decode
represent one way of mapping between those two notions.
Does that help?
Upvotes: 2