Reputation: 710
I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object to reuse it later. I tried to use pickle, but it gave the following error.
*** TypeError: can't pickle instancemethod objects
I tried using joblib
in sklearn.externals, which again gave similar error. Is there any way to save this object so that I can reuse it later?
Here is my full object:
class changeToMatrix(object):
def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
from sklearn.feature_extraction.text import TfidfVectorizer
self.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True,
def load_ref_text(self,text_file):
textfile = open(text_file,'r')
lines = textfile.readlines()
sent_tokenizer ='tokenizers/punkt/english.pickle')
sentences = [item.strip().strip('.') for item in sent_tokenizer.tokenize(' '.join(lines).strip())]
#vectorizer is transformed in this step
chk2 = pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray())
return sentences, [chk2]
def get_processed_data(self,data_loc):
loc = open("indexedData/vectorizer.obj","w")
pickle.dump(self.vectorizer,loc) #getting error here
return ref_sentences, ref_dataframes
Upvotes: 28
Views: 43741
Reputation: 23051
If you arrived at this Q/A to look into pickling a Vectorizer to save space on disk, you can either use joblib
that comes with scikit-learn with compress=True
or use the built-in gzip
module along with pickle
. A working example would look like the following. It compresses the file to be at least 2 times smaller for my use cases.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
import joblib
import pickle
import gzip
data = fetch_20newsgroups().data
tvec = TfidfVectorizer()
# option #1
joblib.dump(tvec, 'vectorizer.pkl', compress=True)
# option #2
with'vectorizer.pkl', 'wb') as f:
pickle.dump(tvec, f)
Upvotes: 5
Reputation: 122012
Firstly, it's better to leave the import at the top of your code instead of within your class:
from sklearn.feature_extraction.text import TfidfVectorizer
class changeToMatrix(object):
def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
Next StemTokenizer
don't seem to be a canonical class. Possibly you've got it from or maybe somewhere else so we'll assume it returns a list of strings.
class StemTokenizer(object):
def __init__(self):
self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}
def __call__(self, doc):
words = []
for word in word_tokenize(doc):
word = word.lower()
w = wn.morphy(word)
if w and len(w) > 1 and w not in self.ignore_set:
return words
Now to answer your actual question, it's possible that you need to open a file in byte mode before dumping a pickle, i.e.:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from nltk import word_tokenize
>>> import cPickle as pickle
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)
>>> vectorizer
TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(0, 2), norm=u'l2', preprocessor=None, smooth_idf=True,
stop_words=None, strip_accents='unicode', sublinear_tf=False,
tokenizer=<function word_tokenize at 0x7f5ea68e88c0>, use_idf=True,
>>> with open('', 'wb') as fin:
... pickle.dump(vectorizer, fin)
>>> exit()
alvas@ubi:~$ ls -lah
-rw-rw-r-- 1 alvas alvas 763 Jun 15 14:18
Note: Using the with
idiom for i/o file access automatically closes the file once you get out of the with
Regarding the issue with SnowballStemmer()
, note that SnowballStemmer('english')
is an object while the stemming function is SnowballStemmer('english').stem
's tokenizer parameter expects to take a string and return a list of stringSo you will need to do this:
>>> from nltk.stem import SnowballStemmer
>>> from nltk import word_tokenize
>>> stemmer = SnowballStemmer('english').stem
>>> def stem_tokenize(text):
... return [stemmer(i) for i in word_tokenize(text)]
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)
>>> with open('', 'wb') as fin:
... pickle.dump(vectorizer, fin)
>>> exit()
alvas@ubi:~$ ls -lah
-rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55
Upvotes: 17