Reputation: 773
Assume I read some content from socket in Python and have to decode it to UTF-8 on-the-fly.
I can not afford to keep all the content in memory, so I must decode it as I receive and save to file.
It can happen, that I will only receive partial bytes of character, (€-sign is represented by three bytes for example in Python as '\xe2\x82\xac').
Assume I have received only the first two bytes (\xe2\x82), if I try to decode it, I'm getting 'UnicodeDecodeError', as expected.
I could always try to decode the current content and check if it throws an Exception
Thanks
Upvotes: 3
Views: 419
Reputation: 18438
How about using a combination of functools.partial and codecs.iterdecode (as shown here)?
I have created a file full of € symbols, and seems to work as expected (although instead of reading from a file, as shown below, you would be reading from your socket):
#!/usr/bin/env python
import codecs
import functools
import sys
with open('stack70.txt', 'rb') as euro_file:
f_iterator = iter(functools.partial(euro_file.read, 1), '')
for item in codecs.iterdecode(f_iterator, 'utf-8'):
print "sizeof item: %s, item: %s" % (sys.getsizeof(item), item)
DISCLAIMER: I have little experience with codecs
, so I'm not 100% sure this will do what you want, but (as far as I can tell), it does, right?
stack70.txt
is the file full of "euro" symbols. The code above outputs:
sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €
sizeof item: 56, item: €
(done using python 2.7)
Upvotes: 1
Reputation: 799390
Guido's time machine strikes again.
>>> dec = codecs.getincrementaldecoder('utf-8')()
>>> dec.decode('foo\xe2\x82')
u'foo'
>>> dec.decode('\xac')
u'\u20ac'
Upvotes: 6