Reputation: 28384
Through cStringIO of another system, I wrote some unicode via:
u'content-length'.encode('utf-8')
and on reading this back using, unicode( stringio_fd.read(),'utf-8')
, I get:
u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00'
printing the above in the terminal gives me the right value, but of course, I can't do anything useful:
print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00")
content-length
print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00") == u'content-length'
False
What's the quickest, cheapest way to turn this string into a string equivalent to u'content-type'
? I can't change from cStringIO
While philhag's answer is correct, it appears the problem is:
StringIO.StringIO(u'content-type').getvalue().encode('utf-8')
'content-type'
StringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')
u'content-type'
cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')
u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00t\x00\x00\x00y\x00\x00\x00p\x00\x00\x00e\x00\x00\x00'
cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8').decode('utf-32')
u'content-type'
Upvotes: 2
Views: 12014
Reputation: 82992
The root cause is that cStringIO.StringIO(unicode_object)
produces a nonsense.
The current 2.X docs on docs.python.org say
Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.
This is unhelpful and incorrect; see below. The chm
version of the docs supplied with the win32 installer for CPython 2.7.2 and 2.6.6 follow that with this sentence:
Calling StringIO() with a Unicode string parameter populates the object with the buffer representation of the Unicode string instead of encoding the string.
This is a correct description of the behaviour (see below). The behaviour is not brilliant. I can't imagine a good reason for that sentence being removed from the web docs.
Behaving badly:
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
>>> import StringIO, cStringIO, sys
>>> StringIO.StringIO(u"fubar").getvalue()
u'fubar' <<=== unicode object
>>> cStringIO.StringIO(u"fubar").getvalue()
'f\x00u\x00b\x00a\x00r\x00' <<=== str object
cStringIO.StringIO(u"\u0405\u0406").getvalue()
'\x05\x04\x06\x04' <<=== "accepts"
>>> sys.maxunicode
65535 # your sender presumably emits 1114111 (wide unicode)
>>> sys.byteorder
'little'
So in general all one needs to do is know/guess the endianness and unicode-width of the sender's Python and decode the mess with UTF-(16|32)-(B|L)E
.
In your case the sender is being rather Byzantine; for example u'content-length'.encode('utf-8')
is the str
object 'content-length'
which bears a remarkable similarity to what you started with. Also foo.encode(utf8').decode('utf8')
produces either foo
or an exception.
Upvotes: 4
Reputation: 288070
Something along the way is encoding your values as UTF-32. Simply decode them:
>>> b = u"c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00\
... n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00\
... n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00"
>>> b.decode('utf-32')
u'content-length'
Upvotes: 6