user2682863
user2682863

Reputation: 3218

Porting pickle py2 to py3 strings become bytes

I have a pickle file that was created with python 2.7 that I'm trying to port to python 3.6. The file is saved in py 2.7 via pickle.dumps(self.saved_objects, -1)

and loaded in python 3.6 via loads(data, encoding="bytes") (from a file opened in rb mode). If I try opening in r mode and pass encoding=latin1 to loads I get UnicodeDecode errors. When I open it as a byte stream it loads, but literally every string is now a byte string. Every object's __dict__ keys are all b"a_variable_name" which then generates attribute errors when calling an_object.a_variable_name because __getattr__ passes a string and __dict__ only contains bytes. I feel like I've tried every combination of arguments and pickle protocols already. Apart from forcibly converting all objects' __dict__ keys to strings I'm at a loss. Any ideas?

** Skip to 4/28/17 update for better example

-------------------------------------------------------------------------------------------------------------

** Update 4/27/17

This minimum example illustrates my problem:

From py 2.7.13

import pickle

class test(object):
    def __init__(self):
        self.x = u"test ¢" # including a unicode str breaks things

t = test()
dumpstr = pickle.dumps(t)

>>> dumpstr
"ccopy_reg\n_reconstructor\np0\n(c__main__\ntest\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'x'\np6\nVtest \xa2\np7\nsb."

From py 3.6.1

import pickle

class test(object):
    def __init__(self):
        self.x = "xyz"

dumpstr = b"ccopy_reg\n_reconstructor\np0\n(c__main__\ntest\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\n(dp5\nS'x'\np6\nVtest \xa2\np7\nsb."

t = pickle.loads(dumpstr, encoding="bytes")

>>> t
<__main__.test object at 0x040E3DF0>
>>> t.x
Traceback (most recent call last):
  File "<pyshell#15>", line 1, in <module>
    t.x
AttributeError: 'test' object has no attribute 'x'
>>> t.__dict__
{b'x': 'test ¢'} 
>>> 

-------------------------------------------------------------------------------------------------------------

Update 4/28/17

To re-create my issue I'm posting my actual raw pickle data here

The pickle file was created in python 2.7.13, windows 10 using

with open("raw_data.pkl", "wb") as fileobj:
    pickle.dump(library, fileobj, protocol=0)

(protocol 0 so it's human readable)

To run it you'll need classes.py

# classes.py

class Library(object): pass


class Book(object): pass


class Student(object): pass


class RentalDetails(object): pass

And the test script here:

# load_pickle.py
import pickle, sys, itertools, os

raw_pkl = "raw_data.pkl"
is_py3 = sys.version_info.major == 3

read_modes = ["rb"]
encodings = ["bytes", "utf-8", "latin-1"]
fix_imports_choices = [True, False]
files = ["raw_data_%s.pkl" % x for x in range(3)]


def py2_test():
    with open(raw_pkl, "rb") as fileobj:
        loaded_object = pickle.load(fileobj)
        print("library dict: %s" % (loaded_object.__dict__.keys()))
        return loaded_object


def py2_dumps():
    library = py2_test()
    for protcol, path in enumerate(files):
        print("dumping library to %s, protocol=%s" % (path, protcol))
        with open(path, "wb") as writeobj:
            pickle.dump(library, writeobj, protocol=protcol)


def py3_test():
    # this test iterates over the different options trying to load
    # the data pickled with py2 into a py3 environment
    print("starting py3 test")
    for (read_mode, encoding, fix_import, path) in itertools.product(read_modes, encodings, fix_imports_choices, files):
        py3_load(path, read_mode=read_mode, fix_imports=fix_import, encoding=encoding)


def py3_load(path, read_mode, fix_imports, encoding):
    from traceback import print_exc
    print("-" * 50)
    print("path=%s, read_mode = %s fix_imports = %s, encoding = %s" % (path, read_mode, fix_imports, encoding))
    if not os.path.exists(path):
        print("start this file with py2 first")
        return
    try:
        with open(path, read_mode) as fileobj:
            loaded_object = pickle.load(fileobj, fix_imports=fix_imports, encoding=encoding)
            # print the object's __dict__
            print("library dict: %s" % (loaded_object.__dict__.keys()))
            # consider the test a failure if any member attributes are saved as bytes
            test_passed = not any((isinstance(k, bytes) for k in loaded_object.__dict__.keys()))
            print("Test %s" % ("Passed!" if test_passed else "Failed"))
    except Exception:
        print_exc()
        print("Test Failed")
    input("Press Enter to continue...")
    print("-" * 50)


if is_py3:
    py3_test()
else:
    # py2_test()
    py2_dumps()

put all 3 in the same directory and run c:\python27\python load_pickle.py first which will create 1 pickle file for each of the 3 protocols. Then run the same command with python 3 and notice that it version converts the __dict__ keys to bytes. I had it working for about 6 hours, but for the life of me I can't figure out how I broke it again.

Upvotes: 12

Views: 2599

Answers (3)

stovfl
stovfl

Reputation: 15513

Question: Porting pickle py2 to py3 strings become bytes

The given encoding='latin-1' below, is ok.
Your Problem with b'' are the result of using encoding='bytes'. This will result in dict-keys being unpickled as bytes instead of as str.

The Problem data are the datetime.date values '\x07á\x02\x10', starting at line 56 in raw-data.pkl.

It's a konwn Issue, as pointed already.
Unpickling python2 datetime under python3
http://bugs.python.org/issue22005

For a workaround, I have patched pickle.py and got unpickled object, e.g.

book.library.books[0].rentals[0].rental_date=2017-02-16


This will work for me:

t = pickle.loads(dumpstr, encoding="latin-1")

Output:
<main.test object at 0xf7095fec>
t.__dict__={'x': 'test ¢'}
test ¢

Tested with Python:3.4.2

Upvotes: 1

Roland Smith
Roland Smith

Reputation: 43495

You should treat pickle data as specific to the (major) version of Python that created it.

(See Gregory Smith's message w.r.t. issue 22005.)

The best way to get around this is to write a Python 2.7 program to read the pickled data, and write it out in a neutral format.

Taking a quick look at your actual data, it seems to me that an SQLite database is appropriate as an interchange format, since the Books contain references to a Library and RentalDetails. You could create separate tables for each.

Upvotes: 1

gz.
gz.

Reputation: 6711

In short, you're hitting bug 22005 with datetime.date objects in the RentalDetails objects.

That can be worked around with the encoding='bytes' parameter, but that leaves your classes with __dict__ containing bytes:

>>> library = pickle.loads(pickle_data, encoding='bytes')
>>> dir(library)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'str' and 'bytes'

It's possible to manually fix that based on your specific data:

def fix_object(obj):
    """Decode obj.__dict__ containing bytes keys"""
    obj.__dict__ = dict((k.decode("ascii"), v) for k, v in obj.__dict__.items())


def fix_library(library):
    """Walk all library objects and decode __dict__ keys"""
    fix_object(library)
    for student in library.students:
            fix_object(student)
    for book in library.books:
            fix_object(book)
            for rental in book.rentals:
                    fix_object(rental)

But that's fragile and enough of a pain you should be looking for a better option.

1) Implement __getstate__/__setstate__ that maps datetime objects to a non-broken representation, for instance:

class Event(object):
    """Example class working around datetime pickling bug"""

    def __init__(self):
            self.date = datetime.date.today()

    def __getstate__(self):
            state = self.__dict__.copy()
            state["date"] = state["date"].toordinal()
            return state

    def __setstate__(self, state):
            self.__dict__.update(state)
            self.date = datetime.date.fromordinal(self.date)

2) Don't use pickle at all. Along the lines of __getstate__/__setstate__, you can just implement to_dict/from_dict methods or similar in your classes for saving their content as json or some other plain format.

A final note, having a backreference to library in each object shouldn't be required.

Upvotes: 9

Related Questions