Reputation: 3

Storing TfIdf model and then loading it to test the new dataset

I m trying to store the TfIdf vectorizer/model(Don't know whether it is a right word or not) obtained after training the dataset and then loading the stored model to fit the new dataset. Model is stored and loaded using pickle

I have stored the vocabulary of TfIdf obtained during training phase. Then, I load the stored the vocabulary to vectorizer to fit the test data

def Savetfidf(df):
    vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,2))
    X = pd.SparseDataFrame(vectorizer.fit_transform(df), columns = vectorizer.get_feature_names(), default_fill_value = 0)
    pickle.dump(vectorizer.vocabulary_, open("features.pkl", "wb"))
    return X

def Loadtfidf(df):
    vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,2))
    vocabulary = pickle.load(open(feature, 'rb'))
    vectorizer.vocabulary_ = vocabulary
    X = pd.SparseDataFrame(vectorizer.transform(df), columns = vectorizer.get_feature_names(), default_fill_value = 0)
    return X

I m getting an error

"sklearn.exceptions.NotFittedError: idf vector is not fitted"

As far as I got to know, it is trying to save the whole 'X' separately using idf_ and vocabulary_. But I just want to store the model/vectorizer(Don't know) so that when next time it load the model/vectorizer, I just need to call vectorizer.fit() for the test data, no need to use the training data to call fit_transform(). Is there any way to do that?

Upvotes: 0

Answers (2)

Raviraj Savaliya

Reputation: 19

If you dump your model which is vectorizer.fit_transform(df) and also dump vectorizer = TfidfVectorizer(min_df=0.0, analyzer="char", sublinear_tf=True, ngram_range=(1,2)). Then in Loadtfidf() load both the pickle file. This will solve your problem.

Upvotes: 0

BlackBear

Reputation: 22989

Following the instructions here, you can (un)pickle the fitted vectorizer object directly, and it will take care of correct (de)serialization on its own.

Upvotes: 0

Storing TfIdf model and then loading it to test the new dataset

Answers (2)

Related Questions