Reputation: 3146
I have a pipeline contains only a feature union that has three different sets of feature, including tfidf:
A_vec = AVectorizer()
B_vec = BVectorizer()
tfidf_vec = TfidfVectorizer(ngram_range=(1,2), analyzer='word', binary=False, stop_words=stopWords, min_df=0.01, use_idf=True)
all_features = FeatureUnion([('A_feature', A_vec), ('V_feature', B_vec), ('tfidf_feature', tfidf_vec)])
pipeline = Pipeline([('all_feature', all_features)])
I want to save this pipelined feature transformer for my test data (I am using LibSVM for classification), and this is what I have tried:
I have used joblib.dump to save this pipeline but it generated toooo many .npy files so I had to stop the writing process. It was a rather stupid attempt!
I have saved tfidf_vec.vocabulary_ and thus
tfidf_vec2 = TfidfVectorizer(ngram_range=(1,3), analyzer='word', binary=False, stop_words=stopWords, min_df=0.01, use_idf=True,vocabulary=pickle.load(open("../vocab.pkl", "rb"))
... ...
feat_test = pipeline2.transform(X_test)
It says "NotFittedError: idf vector is not fitted". I then used fit_transform rather than transform but it generates a feature vector that contains different values (comparing to the correct feature vector). Then I followed http://thiagomarzagao.com/2015/12/08/saving-TfidfVectorizer-without-pickles/ and still struggling to get it work.
Is there a simpler way to achieve this? Thanks!
Upvotes: 5
Views: 4039
Reputation: 784
It is not clear what you want to achieve and what issues are you facing. From what I understand, you tried this
I have used joblib.dump to save this pipeline but it generated toooo many .npy files so I had to stop the writing process. It was a rather stupid attempt!
And as this was not satisfing you, you tried some other alternatives. Well, if you want to generate only one file you can just do this:
joblib.dump(pipeline, 'filename.pkl', compress = 1)
Also, I strongly recomed you to insert a Minimum Viable Example for the next time!
Upvotes: 1
Reputation: 597
I would use joblib.dump as you have it in the first option. How many *.npy files is it generating? What is wrong with having lots of *.npy files?
Upvotes: 0