Reputation: 122142
I can save a serialized corpus into foobar.mm
but when i try to load it, it gives UnpicklingError
. Loading the dictionary seems fine though. Anyone knows how to resolve this? And why does this occur?
>>> from gensim import corpora
>>> docs = ["this is a foo bar", "you are a foo"]
>>> texts = [[i for i in doc.lower().split()] for doc in docs]
>>> print texts
[['this', 'is', 'a', 'foo', 'bar'], ['you', 'are', 'a', 'foo']]
>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('foobar.dic')
>>> print dictionary
Dictionary(7 unique tokens)
>>> corpora.Dictionary.load('foobar.dic')
<gensim.corpora.dictionary.Dictionary object at 0x329f910>
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpora.MmCorpus.serialize('foobar.mm', corpus)
>>> corpus = corpora.MmCorpus.load('foobar.mm')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/gensim-0.8.6-py2.7.egg/gensim/utils.py", line 166, in load
return unpickle(fname)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.8.6-py2.7.egg/gensim/utils.py", line 492, in unpickle
return cPickle.load(open(fname, 'rb'))
cPickle.UnpicklingError: invalid load key, '%'.
Upvotes: 2
Views: 3153
Reputation: 4266
See the documentation at http://radimrehurek.com/gensim/tut1.html#corpus-formats
What you're trying to do is store the corpus in MatrixMarket format (=a text format) and then load it using the save/load binary interface.
To load a serialized MatrixMarket corpus, simply corpus = corpora.MmCorpus('foobar.mm')
Upvotes: 4
Reputation: 59516
Since gensim
's corpora
(whatever this is) is using pickle
as the stacktrace reveals, you will only be able to store data of a limited type. For more details see What can be pickled and unpickled? in the Python docs.
If this does not apply (i. e. if what you want to pickle and unpickle should be picklable) I fear you might have found a bug in the pickle module. Maybe you then can solve your issue by upgrading to a newer Python version.
Upvotes: -1