Reputation: 3146

How to save sklearn pipeline/feature-transformer

I have a pipeline contains only a feature union that has three different sets of feature, including tfidf:

A_vec = AVectorizer()
B_vec = BVectorizer()
tfidf_vec = TfidfVectorizer(ngram_range=(1,2), analyzer='word', binary=False, stop_words=stopWords, min_df=0.01, use_idf=True)
all_features = FeatureUnion([('A_feature', A_vec), ('V_feature', B_vec), ('tfidf_feature', tfidf_vec)])
pipeline = Pipeline([('all_feature', all_features)])

I want to save this pipelined feature transformer for my test data (I am using LibSVM for classification), and this is what I have tried:

I have used joblib.dump to save this pipeline but it generated toooo many .npy files so I had to stop the writing process. It was a rather stupid attempt!
I have saved tfidf_vec.vocabulary_ and thus

tfidf_vec2 = TfidfVectorizer(ngram_range=(1,3), analyzer='word', binary=False, stop_words=stopWords, min_df=0.01, use_idf=True,vocabulary=pickle.load(open("../vocab.pkl", "rb"))

... ...

feat_test = pipeline2.transform(X_test)

It says "NotFittedError: idf vector is not fitted". I then used fit_transform rather than transform but it generates a feature vector that contains different values (comparing to the correct feature vector). Then I followed http://thiagomarzagao.com/2015/12/08/saving-TfidfVectorizer-without-pickles/ and still struggling to get it work.

Is there a simpler way to achieve this? Thanks!

Upvotes: 5

Answers (2)

Federico Dorato

Reputation: 784

It is not clear what you want to achieve and what issues are you facing. From what I understand, you tried this

I have used joblib.dump to save this pipeline but it generated toooo many .npy files so I had to stop the writing process. It was a rather stupid attempt!

And as this was not satisfing you, you tried some other alternatives. Well, if you want to generate only one file you can just do this:

joblib.dump(pipeline, 'filename.pkl', compress = 1)

Also, I strongly recomed you to insert a Minimum Viable Example for the next time!

Upvotes: 1

user2161903

Reputation: 597

I would use joblib.dump as you have it in the first option. How many *.npy files is it generating? What is wrong with having lots of *.npy files?

Upvotes: 0

How to save sklearn pipeline/feature-transformer

Answers (2)

Related Questions