Classifying text with scikit

Question

I'm learning Scikit machine-learning for a project and while I'm beginning to grasp the general process the details are a bit fuzzy still.

Earlier I managed to build a classifier, train it and test it with test set. I saved it to disk with cPickle. Now I want to create a class which loads this classifier and lets user to classify single tweets with it.

I thought this would be trivial but I seem to get ValueError('dimension mismatch') from X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec) line with following code:

class TweetClassifier:

classifier = None
vect = TfidfVectorizer()
tfidf_transformer = TfidfTransformer()

#open the classifier saved to disk to be utilized later
def openClassifier(self, name):
    with open(name+'.pkl', 'rb') as fid:
        return cPickle.load(fid)

def __init__(self, classifierName):
    self.classifier = self.openClassifier(classifierName)
    self.classifyTweet(np.array([u"Helvetin vittu miksi aina pitää sataa vettä???"]))

def classifyTweet(self, tweetText):

    fitTweetVec = self.vect.fit_transform(tweetText)
    print self.vect.get_feature_names()
    X_new_tfidf = self.tfidf_transformer.fit_transform(fitTweetVec)
    print self.classifier.predict(X_new_tfidf)

What I'm doing wrong here? I used similar code while I made the classifier and ran test set for it. Have I forgotten some important step here?

Now I admit that I don't fully understand yet the fitting and transforming here since I found the Scikit's tutorial a bit ambiguous about it. If someone knows an as clear explanation of them as possible, I'm all for links :)

elyase · Accepted Answer

The problem is that your classifier was trained with a fixed number of features (the length of the vocabulary of your previous data) and now when you fit_transform the new tweet, the TfidfTransformer will produce a new vocabulary and a new number of features, and will represent the new tweet in this space.

The solution is to also save the previously fitted TfidfTransformer (which contains the old vocabulary), load it with the classifier and .transform (not fit_transform because it was already fitted to the old data) the new tweet in this same representation.

You can also use a Pipeline that contains both the TfidfTransformer and the Classifier and pickle the Pipeline, this is easier and recommended.

Classifying text with scikit

Answers (1)

Related Questions