Reputation: 3475
I know this has been covered by a number of other questions (Unable to load files using pickle and multipile modules) but I can't see how their solutions apply to my situation.
This is my project structure (as minimal as possible):
classify-updater/
├── main.py
└── updater
├── __init__.py
└── updater.py
classify
└── main.py
In classify-updater/main.py
:
import sys
from sklearn.feature_extraction.text import CountVectorizer
from updater.updater import Updater
def main(argv):
vectorizer = CountVectorizer(stop_words='english')
updater = Updater(vectorizer)
updater.update()
if __name__ == "__main__":
main(sys.argv)
In classify-updater/updater/updater.py
:
import dill
class Updater:
def __init__(vectorizer):
vectorizer.preprocessor = lambda doc: doc.text.encode('ascii', 'ignore')
self.vectorizer = vectorizer
def update(self):
pickled_vectorizer = dill.dumps(self.vectorizer)
# Save to Google Cloud Storage
In classify/main.py
import dill
import sys
def main(argv):
# Load from Google Cloud Storage
vectorizer = dill.loads(vectorizer_blob)
if __name__ == "__main__":
main(sys.argv)
This results in an ImportError
.
Traceback (most recent call last):
File "classify.py", line 102, in <module>
app.main(sys.argv)
File "classify.py", line 50, in main
vectorizer = self.fetch_vectorizer()
File "classify.py", line 86, in fetch_vectorizer
vectorizer = dill.loads(vectorizer_blob.download_as_string())
File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 299, in loads
return load(file)
File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 288, in load
obj = pik.load()
File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
dispatch[key](self)
File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 445, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1130, in find_class
__import__(module)
ImportError: No module named updater.updater
It has been explained elsewhere that pickle needs the class definition to load the object, but I can't see where the reference to the updater module comes from as I'm only pickling an instance of the Vectorizer.
I've simplified this example heavily. The two packages sit quite far apart in terms of our codebase. Importing one module into the other might not be feasible. Is there any way to work around this?
Upvotes: 1
Views: 1850
Reputation: 3475
The issue here is the lambda (anonymous function).
It is completely possible to pickle a self-contained object like the Vectorizer. However, the preprocessing function used in the example is scoped to the Updater class so the Updater class is required to unpickle.
Rather than having a preprocessor function, preprocess the data yourself and pass that in to fit the vectorizer. That will remove the need for the Updater class when unpickling.
Upvotes: 2