Tasos
Tasos

Reputation: 7587

Make predictions from a saved trained classifier in Scikit Learn

I wrote a classifier for Tweets in Python which then I saved it in .pkl format on disk, so I can run it again and again without the need to train it each time. This is the code:

import pandas
import re
from sklearn.feature_extraction import FeatureHasher

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn import cross_validation

from sklearn.externals import joblib


#read the dataset of tweets

header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.data.csv",names=header_row)

#keep only the right columns

train = train[["sentiment","text"]]

#remove puctuation, special characters, numbers and lower case the text

def remove_spch(text):

    return re.sub("[^a-z]", ' ', text.lower())

train['text'] = train['text'].apply(remove_spch)


#Feature Hashing

def tokens(doc):
    """Extract tokens from doc.

    This uses a simple regex to break strings into tokens.
    """
    return (tok.lower() for tok in re.findall(r"\w+", doc))

n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])

y = train['sentiment']

X_new = SelectKBest(chi2, k=20000).fit_transform(X, y)

a_train, a_test, b_train, b_test = cross_validation.train_test_split(X_new, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier 

classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(a_train.toarray(), b_train)                            
prediction = classifier.predict(a_test.toarray()) 

#Export the trained model to load it in another project

joblib.dump(classifier, 'my_model.pkl', compress=9)

Let's say that I have another Python file and I want to classify a Tweet. How can I proceed to do the classification?

from sklearn.externals import joblib
model_clone = joblib.load('my_model.pkl')

mytweet = 'Uh wow:@medium is doing a crowdsourced data-driven investigation tracking down a disappeared refugee boat'

Up to the hasher.transform I can replicate the same procedure to add it to the prediction model, but then I have the problem that I cannot calculate the best 20k features. To use the SelectKBest, you need to add both features and label. Since, I want to predict the label, I cannot use the SelectKBest. So, how can I pass this issue to continue on the prediction?

Upvotes: 3

Views: 3735

Answers (1)

lanenok
lanenok

Reputation: 2749

I support the comment of @EdChum that

you build a model by training it on data which presumably is representative enough for it to cope with unseen data

Practically this means that you need to apply both FeatureHasher and SelectKBest to your new data with predict only. (It is wrong to train FeatureHasher anew on the new data, because in general it will produce different features).

To do this either

  • pickle FeatureHasher and SelectKBest separately

or (better)

  • make a Pipeline of FeatureHasher, SelectKBest, and RandomForestClassifier and pickle the whole pipeline. Then you can load this pipeline and use predict on a new data.

Upvotes: 5

Related Questions