Reputation: 1140
I'm in the process of writing a naive bayes classifier because I have a large group of text documents that I need to classify. However when I try to test my predictions I get the following error
sklearn.utils.validation.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
What I did before asking here
I'm aware of the theory of how naive bayes classification works.
P(B|A)*P(A)
P(A|B) = ____________________
P(B|A)*P(A) + P(B|C)*P(C) +...+P(B|n)*P(n)
where A to n are the distinct classes that you want to classify and P(B|A) is the probability of B occurring given that A has occurred and P(A) is the probability of A occurring. It should be noted that I'm specifically working with the multinomial naive bayes.
I also found this question:
SciPy and scikit-learn - ValueError: Dimension mismatch
as well as this question
cannot cast array data when a saved classifier is called
However, I'm still having problems when I try to make predictions or test my predictions.
I've written the following functions that I use to create a training and a testing set
def split_data_set(original_data_set, percentage):
test_set = []
train_set = []
forbidden = set()
split_sets = {}
if is_float(percentage):
stop_len = int(percentage * len(original_data_set))
while len(train_set) < stop_len:
random_selection = randrange(0, len(original_data_set))
if random_selection not in forbidden:
forbidden.add(random_selection)
train_set.append(original_data_set[random_selection])
for j in range(0, len(original_data_set)-1):
if j not in forbidden:
test_set.append(original_data_set[j])
split_sets.update({'testing set': test_set})
split_sets.update({'training set': train_set})
split_sets.update({'forbidden': forbidden})
return split_sets
create and train a model
def create_and_fit_baes_model(data_set):
train = []
expect = []
for data in data_set['training set']:
train.append(data[1])
expect.append(data[0])
vectorizer = TfidfVectorizer(min_df=1)
# I think this is one of the places where I'm doing something
# incorrectly
vectorized_training_data = vectorizer.fit_transform(train)
model = MultinomialNB()
model.fit(vectorized_training_data, expect)
return model
and to test my model
def test_nb_model(data_set, model):
test = []
expect = []
for data in data_set['testing set']:
test.append(data[1])
expect.append(data[0])
#This is the other section where I think that
# I'm doing something incorrectly
vectorizer = TfidfVectorizer(min_df=1)
vectorized_testing_data = vectorizer.transform(test)
fitted_vectorized_testing_data = vectorizer.fit(vectorized_testing_data)
predicted = model.predict(fitted_vectorized_testing_data)
print(metrics.confusion_matrix(expect,predicted))
print(metrics.classification_report(expect, predicted))
I believe that I'm having a problem during the transformation/fitting stage.
I know that tfidf vectorization works as follows
This would be a regular matrix made up of documents that have the counts for different terms.
_term1____term2____term3____termn____________
doc1| 5 | 0 | 13 | 1
doc2| 0 | 8 | 2 | 0
doc3| 1 | 5 | 5 | 10
. | . | . | . | .
. | . | . | . | .
. | . | . | . | .
docn| 10 | 0 | 0 | 0
From here you apply a weighting scheme to determine how important specific words are to your corpus.
I know how all of this works in theory and I can work the math out on paper, but when I try reading the documentation for sklearn I'm still a little confused as to how I'm supposed to code everything.
I've been struggling with this for the past two days. If someone could provide some insight into what I'm doing wrong and how I can fully train and run my model I'd appreciate it.
Upvotes: 0
Views: 2272
Reputation: 36545
I think the cleanest option is use a Pipeline
to package your vectorizer with your classifier; then if you call model.fit
, this will fit the vocabulary and term frequencies of your vectorizer, and make them available for later functions. This way you can still just return a single "model" from your training function, and you can also pickle this if you need to save your model.
from sklearn.pipeline import Pipeline
def create_and_fit_model(data):
# ... get your train and expect data
vectorizer = TfidfVectorizer(min_df=1)
nb = MultinomialNB()
model = Pipeline([('vectorizer', vectorizer), ('nb', nb)])
model.fit(train, expect)
return model
By the way you don't need to write your own code for train/test split, you can use sklearn.cross_validation.train_test_split
. Also you should look at using pandas for storing your data rather than plain lists; it will make it easier to extract columns.
Upvotes: 2