Reputation: 3
I m trying to store the TfIdf vectorizer/model(Don't know whether it is a right word or not) obtained after training the dataset and then loading the stored model to fit the new dataset. Model is stored and loaded using pickle
I have stored the vocabulary of TfIdf obtained during training phase. Then, I load the stored the vocabulary to vectorizer to fit the test data
def Savetfidf(df):
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,2))
X = pd.SparseDataFrame(vectorizer.fit_transform(df), columns = vectorizer.get_feature_names(), default_fill_value = 0)
pickle.dump(vectorizer.vocabulary_, open("features.pkl", "wb"))
return X
def Loadtfidf(df):
vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,2))
vocabulary = pickle.load(open(feature, 'rb'))
vectorizer.vocabulary_ = vocabulary
X = pd.SparseDataFrame(vectorizer.transform(df), columns = vectorizer.get_feature_names(), default_fill_value = 0)
return X
I m getting an error
"sklearn.exceptions.NotFittedError: idf vector is not fitted"
As far as I got to know, it is trying to save the whole 'X' separately using idf_ and vocabulary_. But I just want to store the model/vectorizer(Don't know) so that when next time it load the model/vectorizer, I just need to call vectorizer.fit() for the test data, no need to use the training data to call fit_transform(). Is there any way to do that?
Upvotes: 0
Views: 521
Reputation: 19
If you dump your model which is vectorizer.fit_transform(df) and also dump vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,2)). Then in Loadtfidf() load both the pickle file. This will solve your problem.
Upvotes: 0