thrinadhn
thrinadhn

Reputation: 2503

Loading pickle NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted

multilabel classification

I am trying to predict a multilabel classification using scikit-learn/pandas/OneVsRestClassifier/logistic regression. Building and evaluating the model works but attempting to classify new sample text does not.

scenario 1:

Once I build a model saved the model with the name(sample.pkl) and restarting my kernel, but when I load the saved model(sample.pkl) during prediction on sample text getting its giving error:

 NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

I build the model and evaluate the model and i save it the model wtith the name sample.pkl. i restrat my kernal then i load the model making prediction on sample text NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted

inference

import pickle,os
import collections
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
import json, nltk, re, csv, pickle
from sklearn.metrics import f1_score # performance matrix
from sklearn.multiclass import OneVsRestClassifier # binary relavance
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import train_test_split  
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
stop_words = set(stopwords.words('english'))

def cleanHtml(sentence):
'''' remove the tags '''
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', str(sentence))
    return cleantext


def cleanPunc(sentence): 
''' function to clean the word of any
    punctuation or special characters '''
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned

def keepAlpha(sentence):
""" keep the alpha sentenes """
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
return alpha_sent

def remove_stopwords(text):
""" remove stop words """
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)

test1 = pd.read_csv("C:\\Users\\abc\\Downloads\\test1.csv")
test1.columns

test1.head()
siNo  plot                              movie_name       genre_new
1     The story begins with Hannah...   sing             [drama,teen]
2     Debbie's favorite band is Dream.. the bigeest fan  [drama]
3     This story of a Zulu family is .. come back,africa [drama,Documentary]

getting Error I am getting the error here when iam inference on sample text

def infer_tags(q):
    q = cleanHtml(q)
    q = cleanPunc(q)
    q = keepAlpha(q)
    q = remove_stopwords(q)
    multilabel_binarizer = MultiLabelBinarizer()
    tfidf_vectorizer = TfidfVectorizer()
    q_vec = tfidf_vectorizer.transform([q])
    q_pred = clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)


for i in range(5):
    print(i)
    k = test1.sample(1).index[0] 
    print("Movie: ", test1['movie_name'][k], "\nPredicted genre: ", infer_tags(test1['plot'][k])), print("Actual genre: ",test1['genre_new'][k], "\n")

enter image description here

solved

I solved the i save tfidf and multibiniraze into pickle model

from sklearn.externals import joblib
pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pickle", "wb"))
pickle.dump(multilabel_binarizer, open("multibinirizer_vectorizer.pickle", "wb"))
vectorizer = joblib.load('/abc/downloads/tfidf_vectorizer.pickle')
multilabel_binarizer = joblib.load('/abc/downloads/multibinirizer_vectorizer.pickle')


def infer_tags(q):
    q = cleanHtml(q)
    q = cleanPunc(q)
    q = keepAlpha(q)      
    q = remove_stopwords(q)
    q_vec = vectorizer .transform([q])
    q_pred = rf_model.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)

i go though the below link i got the solution ,https://stackoverflow.com/questions/32764991/how-do-i-store-a-tfidfvectorizer-for-future-use-in-scikit-learn>

Upvotes: 3

Views: 1234

Answers (1)

0x5050
0x5050

Reputation: 1231

This happens because you are only dumping the classifier into the pickle and not the vectorizer.

During inference, when you call

 tfidf_vectorizer = TfidfVectorizer()

, your vectorizer is not fitted on the training vocabulary, which is giving the error.

What you should do is, dump both the classifier and the vectorizer to pickle. Load them both during inference.

Upvotes: 1

Related Questions