Josh
Josh

Reputation: 3475

Pickling and Unpickling in different modules

I know this has been covered by a number of other questions (Unable to load files using pickle and multipile modules) but I can't see how their solutions apply to my situation.

This is my project structure (as minimal as possible):

classify-updater/
├── main.py
└── updater
    ├── __init__.py
    └── updater.py
classify
└── main.py

In classify-updater/main.py:

import sys
from sklearn.feature_extraction.text import CountVectorizer
from updater.updater import Updater

def main(argv):
    vectorizer = CountVectorizer(stop_words='english')
    updater = Updater(vectorizer)
    updater.update()

if __name__ == "__main__":
    main(sys.argv)

In classify-updater/updater/updater.py:

import dill

class Updater:

    def __init__(vectorizer):
        vectorizer.preprocessor = lambda doc: doc.text.encode('ascii', 'ignore')
        self.vectorizer = vectorizer

    def update(self):
        pickled_vectorizer = dill.dumps(self.vectorizer)
        # Save to Google Cloud Storage

In classify/main.py

import dill
import sys

def main(argv):
    # Load from Google Cloud Storage
    vectorizer = dill.loads(vectorizer_blob)

if __name__ == "__main__":
    main(sys.argv)

This results in an ImportError.

Traceback (most recent call last):
  File "classify.py", line 102, in <module>
    app.main(sys.argv)
  File "classify.py", line 50, in main
    vectorizer = self.fetch_vectorizer()
  File "classify.py", line 86, in fetch_vectorizer
    vectorizer = dill.loads(vectorizer_blob.download_as_string())
  File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 299, in loads
    return load(file)
  File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 288, in load
    obj = pik.load()
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1096, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 445, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1130, in find_class
    __import__(module)
ImportError: No module named updater.updater

It has been explained elsewhere that pickle needs the class definition to load the object, but I can't see where the reference to the updater module comes from as I'm only pickling an instance of the Vectorizer.

I've simplified this example heavily. The two packages sit quite far apart in terms of our codebase. Importing one module into the other might not be feasible. Is there any way to work around this?

Upvotes: 1

Views: 1850

Answers (1)

Josh
Josh

Reputation: 3475

The issue here is the lambda (anonymous function).

It is completely possible to pickle a self-contained object like the Vectorizer. However, the preprocessing function used in the example is scoped to the Updater class so the Updater class is required to unpickle.

Rather than having a preprocessor function, preprocess the data yourself and pass that in to fit the vectorizer. That will remove the need for the Updater class when unpickling.

Upvotes: 2

Related Questions