j.jerrod.taylor
j.jerrod.taylor

Reputation: 1140

How do I tfidf transform and "fit" the values for my text classifier correctly?

I'm in the process of writing a naive bayes classifier because I have a large group of text documents that I need to classify. However when I try to test my predictions I get the following error

sklearn.utils.validation.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

What I did before asking here

I'm aware of the theory of how naive bayes classification works.

                         P(B|A)*P(A)
  P(A|B) =           ____________________

           P(B|A)*P(A) + P(B|C)*P(C) +...+P(B|n)*P(n)

where A to n are the distinct classes that you want to classify and P(B|A) is the probability of B occurring given that A has occurred and P(A) is the probability of A occurring. It should be noted that I'm specifically working with the multinomial naive bayes.

I also found this question:

SciPy and scikit-learn - ValueError: Dimension mismatch

as well as this question

cannot cast array data when a saved classifier is called

However, I'm still having problems when I try to make predictions or test my predictions.

I've written the following functions that I use to create a training and a testing set

def split_data_set(original_data_set, percentage):
    test_set = []
    train_set = []

    forbidden = set()

    split_sets = {}

    if is_float(percentage):
        stop_len = int(percentage * len(original_data_set))

    while len(train_set) < stop_len:
        random_selection = randrange(0, len(original_data_set))
        if random_selection not in forbidden:
            forbidden.add(random_selection)
            train_set.append(original_data_set[random_selection])

    for j in range(0, len(original_data_set)-1):
        if j not in forbidden:
            test_set.append(original_data_set[j])

    split_sets.update({'testing set': test_set})
    split_sets.update({'training set': train_set})
    split_sets.update({'forbidden': forbidden})

    return split_sets

create and train a model

def create_and_fit_baes_model(data_set):

    train = []
    expect = []

    for data in data_set['training set']:
        train.append(data[1])
        expect.append(data[0])

    vectorizer = TfidfVectorizer(min_df=1)

    # I think this is one of the places where I'm doing something 
    # incorrectly
    vectorized_training_data = vectorizer.fit_transform(train)


    model = MultinomialNB()


    model.fit(vectorized_training_data, expect)

    return model

and to test my model

def test_nb_model(data_set, model):

    test = []
    expect = []

    for data in data_set['testing set']:
        test.append(data[1])
        expect.append(data[0])

    #This is the other section where I think that 
    # I'm doing something incorrectly
    vectorizer = TfidfVectorizer(min_df=1)
    vectorized_testing_data = vectorizer.transform(test)
    fitted_vectorized_testing_data = vectorizer.fit(vectorized_testing_data)

    predicted = model.predict(fitted_vectorized_testing_data)

    print(metrics.confusion_matrix(expect,predicted))
    print(metrics.classification_report(expect, predicted))

I believe that I'm having a problem during the transformation/fitting stage.

I know that tfidf vectorization works as follows

This would be a regular matrix made up of documents that have the counts for different terms.

     _term1____term2____term3____termn____________
doc1|   5  |    0   |     13   |   1
doc2|   0  |    8   |     2    |   0
doc3|   1  |    5   |     5    |   10
.   |   .  |    .   |     .    |   .
.   |   .  |    .   |     .    |   .
.   |   .  |    .   |     .    |   .
docn|   10 |    0   |     0    |   0

From here you apply a weighting scheme to determine how important specific words are to your corpus.

I know how all of this works in theory and I can work the math out on paper, but when I try reading the documentation for sklearn I'm still a little confused as to how I'm supposed to code everything.

I've been struggling with this for the past two days. If someone could provide some insight into what I'm doing wrong and how I can fully train and run my model I'd appreciate it.

Upvotes: 0

Views: 2272

Answers (1)

maxymoo
maxymoo

Reputation: 36545

I think the cleanest option is use a Pipeline to package your vectorizer with your classifier; then if you call model.fit, this will fit the vocabulary and term frequencies of your vectorizer, and make them available for later functions. This way you can still just return a single "model" from your training function, and you can also pickle this if you need to save your model.

from sklearn.pipeline import Pipeline

def create_and_fit_model(data):
    # ... get your train and expect data
    vectorizer = TfidfVectorizer(min_df=1)
    nb = MultinomialNB()
    model = Pipeline([('vectorizer', vectorizer), ('nb', nb)])
    model.fit(train, expect)
    return model

By the way you don't need to write your own code for train/test split, you can use sklearn.cross_validation.train_test_split. Also you should look at using pandas for storing your data rather than plain lists; it will make it easier to extract columns.

Upvotes: 2

Related Questions