evolution
evolution

Reputation: 593

python latin-1 UnicodeDecodeError after switching to ubuntu 14 with couchdb cPickle binary data

For some strange reason my python code stopped working after I switched from ubuntu 12 to ubuntu 14. I can't unpickle my data any more. I stored the data in a couchdb database by converting to latin1 encoding.

I'm using latin1 because I read some time ago (I don't have the link any more) that it is the only encoding I can use to store and retrieve cPickled binary data from a couchdb database. It was meant to avoid encoding issues with json (couchdbkit uses json in background).

Latin1 was supposed to map 256 characters to 256 characters, which would be exactly byte by byte. Now, after system upgrade, python seems to complain as if there were only 128 valid values and throws UnicodeDecodeError (see below)

Not sure you need all those details, but here are some declarations I use:

#deals with all the errors when saving an item
def saveitem(item):  
    item.set_db(self.db)
    item["_id"] = key  
    error = True
    while error:
        try:    
            item.save()
            error = False
        except ResourceConflict:
            try:
                item = DBEntry.get_or_create(key)
            except ResourceConflict:
                pass
        except (NoMoreData) as e:
            print "CouchDB.set.saveitem: NoMoreData error, retrying...", str(e)
        except (RequestError) as e:
            print "CouchDB.set.saveitem: RequestError error. retrying...", str(e)

#deals with most of what could go wrong when adding an attachment
def addattachment(item, content, name = "theattachment"):
    key = item["_id"]
    error = True
    while error:
        try:
            item.put_attachment(content = content, name = name) #, content_type = "application/octet-stream"
            error = False
        except ResourceConflict:
            try:
                item = DBEntry.get_or_create(key)
            except ResourceConflict:
                print "addattachment ResourceConflict, retrying..."
            except NoMoreData:
                print "addattachment NoMoreData, retrying..."

        except (NoMoreData) as e:
            print key, ": no more data exception, wating 1 sec and retrying... -> ", str(e)
            time.sleep(1)
            item = DBEntry.get_or_create(key)
        except (IOError) as e:
            print "addattachment IOError:", str(e), "repeating..." 
            item = DBEntry.get_or_create(key)
        except (KeyError) as e:
            print "addattachment error:", str(e), "repeating..." 
            try:
                item = DBEntry.get_or_create(key)
            except ResourceConflict:
                pass
            except (NoMoreData) as e:
                pass

Then I save as follows:

        pickled = cPickle.dumps(obj = value, protocol = 2)
        pickled = pickled.decode('latin1')
        item = DBEntry(content={"seeattachment": True, "ispickled" : True},
            creationtm=datetime.datetime.utcnow(),lastaccesstm=datetime.datetime.utcnow())
        item = saveitem(item)
        addattachment(item, pickled)

And here is how I unpack. Data was written under ubuntu 12. Fails to unpack under ubuntu 14:

def unpackValue(self, value, therawkey):
    if value is None: return None
    originalval = value
    value = value["content"]
    result = None
    if value.has_key("realcontent"):
        result = value["realcontent"]
    elif value.has_key("seeattachment"):
        if originalval.has_key("_attachments"):
            if originalval["_attachments"].has_key("theattachment"):
                if originalval["_attachments"]["theattachment"].has_key("data"):
                    result = originalval["_attachments"]["theattachment"]["data"]
                    result = base64.b64decode(result)
                else:
                    print "unpackvalue: no data in attachment. Here is how it looks like:"
                    print originalval["_attachments"]["theattachment"].iteritems()
        else:
            error = True
            while error:
                try:
                    result = self.db.fetch_attachment(therawkey, "theattachment")
                    error = False
                except ResourceConflict:
                    print "could not get attachment for", therawkey, "retrying..."
                    time.sleep(1)
                except ResourceNotFound:
                    self.delete(key = therawkey, rawkey = True)
                    return None

        if value["ispickled"]:
            result = cPickle.loads(result.encode('latin1'))
    else:
        result = value

    if isinstance(result, unicode): result = result.encode("utf8")
    return result

The line result = cPickle.loads(result.encode('latin1')) succeeds under ubuntu 12 but it fails under ubuntu 14. Following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

I did NOT get that error under ubuntu 12!

How can I read my data under ubuntu 14 while keeping the newer couchdbkit and python versions? Is that even a versioning problem? Why is that error happening?

Upvotes: 0

Views: 512

Answers (1)

unutbu
unutbu

Reputation: 879691

It appears that there is some change -- possibly in couchdbkit's API -- which makes result a UTF-8 encoded str whereas before it was unicode.

Since you want to encode the unicode in latin1, the work-around is to use

cPickle.loads(result.decode('utf8').encode('latin1'))

Note that it would be better to find where result is getting UTF-8 encoded and either preventing that from happening (so you still have unicode as you did under Ubuntu 12) or changing the encoding to latin1 so that result will already be in the form you desire.

Upvotes: 1

Related Questions