Reputation: 5413
I'm trying to understand the encoding stuff in python and I think I nearly managed it to understand. So here is some code which I will explain and I would like you to verify my thoughts :)
text = line.decode( encoding )
print "type(text) = %s" % type(text)
iso_8859_1 = text.encode('latin1')
print "type(iso_8859_1) = %s" % type(iso_8859_1)
unicodeStr = text.encode('utf-8')
print "type(unicodeStr) = %s" % type(unicodeStr)
So the first line
text = line.decode( encoding )
does transform a given string given in the encoding "encoding" into a unicode text format of python. Therefore the output is
type(text) = <type 'unicode'>
So now, I using the original text from my file in an utf-8 encoding style and for the rest of my code "text" is a utf-8 text.
Now I want to transform (for what reason ever) the utf-8 text into some other stuff e.g. latin1 which is done by "text.encode('latin1')". The output of my code in that case is
type(iso_8859_1) = <type 'str'>
type(unicodeStr) = <type 'str'>
Now, the only question that remains for me: Why is the type in the two latter cases 'str' and not 'latin1' or 'unicode'. That's what's still unclear to me.
Are the latter strings "iso_8859_1" and "unicodeStr" not encoded in "latin1" or "unicode" resprectivly?
Upvotes: 0
Views: 107
Reputation: 5395
First, utf8 != unicode.
str is basically a sequence of bytes and encoding is method of interpreting those sequence, and unicode is, well - unicode.
Joel had great post on this subject http://www.joelonsoftware.com/articles/Unicode.html
Upvotes: 1