Python "\x00" filled / utf-32 string from cStringIO

Question

Through cStringIO of another system, I wrote some unicode via:

u'content-length'.encode('utf-8')

and on reading this back using, unicode( stringio_fd.read(),'utf-8'), I get:

u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00'

printing the above in the terminal gives me the right value, but of course, I can't do anything useful:

print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00")

content-length

print unicode("c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00l\x00\x00\x00e\x00\x00\x00n\x00\x00\x00g\x00\x00\x00t\x00\x00\x00h\x00\x00\x00") == u'content-length'

False

What's the quickest, cheapest way to turn this string into a string equivalent to u'content-type' ? I can't change from cStringIO

Updates

While philhag's answer is correct, it appears the problem is:

StringIO.StringIO(u'content-type').getvalue().encode('utf-8')

'content-type'

StringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')

u'content-type'

cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8')

u'c\x00\x00\x00o\x00\x00\x00n\x00\x00\x00t\x00\x00\x00e\x00\x00\x00n\x00\x00\x00t\x00\x00\x00-\x00\x00\x00t\x00\x00\x00y\x00\x00\x00p\x00\x00\x00e\x00\x00\x00'

cStringIO.StringIO(u'content-type').getvalue().encode('utf-8').decode('utf-8').decode('utf-32')

u'content-type'

John Machin · Accepted Answer

The root cause is that cStringIO.StringIO(unicode_object) produces a nonsense.

The current 2.X docs on docs.python.org say

Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

This is unhelpful and incorrect; see below. The chm version of the docs supplied with the win32 installer for CPython 2.7.2 and 2.6.6 follow that with this sentence:

Calling StringIO() with a Unicode string parameter populates the object with the buffer representation of the Unicode string instead of encoding the string.

This is a correct description of the behaviour (see below). The behaviour is not brilliant. I can't imagine a good reason for that sentence being removed from the web docs.

Behaving badly:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
>>> import StringIO, cStringIO, sys
>>> StringIO.StringIO(u"fubar").getvalue()
u'fubar' <<=== unicode object
>>> cStringIO.StringIO(u"fubar").getvalue()
'f\x00u\x00b\x00a\x00r\x00' <<=== str object
cStringIO.StringIO(u"\u0405\u0406").getvalue()
'\x05\x04\x06\x04' <<=== "accepts"
>>> sys.maxunicode
65535 # your sender presumably emits 1114111 (wide unicode)
>>> sys.byteorder
'little'

So in general all one needs to do is know/guess the endianness and unicode-width of the sender's Python and decode the mess with UTF-(16|32)-(B|L)E.

In your case the sender is being rather Byzantine; for example u'content-length'.encode('utf-8') is the str object 'content-length' which bears a remarkable similarity to what you started with. Also foo.encode(utf8').decode('utf8') produces either foo or an exception.

Python "\x00" filled / utf-32 string from cStringIO

Updates

Answers (2)

Related Questions

Python &quot;\x00&quot; filled / utf-32 string from cStringIO

Updates

Answers (2)

Related Questions

Python "\x00" filled / utf-32 string from cStringIO