Reputation: 7587
I wrote a classifier for Tweets in Python which then I saved it in .pkl
format on disk, so I can run it again and again without the need to train it each time. This is the code:
import pandas
import re
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import cross_validation
from sklearn.externals import joblib
#read the dataset of tweets
header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.data.csv",names=header_row)
#keep only the right columns
train = train[["sentiment","text"]]
#remove puctuation, special characters, numbers and lower case the text
def remove_spch(text):
return re.sub("[^a-z]", ' ', text.lower())
train['text'] = train['text'].apply(remove_spch)
#Feature Hashing
def tokens(doc):
"""Extract tokens from doc.
This uses a simple regex to break strings into tokens.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))
n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])
y = train['sentiment']
X_new = SelectKBest(chi2, k=20000).fit_transform(X, y)
a_train, a_test, b_train, b_test = cross_validation.train_test_split(X_new, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train.toarray(), b_train)
prediction = classifier.predict(a_test.toarray())
#Export the trained model to load it in another project
joblib.dump(classifier, 'my_model.pkl', compress=9)
Let's say that I have another Python file and I want to classify a Tweet. How can I proceed to do the classification?
from sklearn.externals import joblib
model_clone = joblib.load('my_model.pkl')
mytweet = 'Uh wow:@medium is doing a crowdsourced data-driven investigation tracking down a disappeared refugee boat'
Up to the hasher.transform
I can replicate the same procedure to add it to the prediction model, but then I have the problem that I cannot calculate the best 20k features. To use the SelectKBest, you need to add both features and label. Since, I want to predict the label, I cannot use the SelectKBest. So, how can I pass this issue to continue on the prediction?
Upvotes: 3
Views: 3735
Reputation: 2749
I support the comment of @EdChum that
you build a model by training it on data which presumably is representative enough for it to cope with unseen data
Practically this means that you need to apply both FeatureHasher
and SelectKBest
to your new data with predict
only. (It is wrong to train FeatureHasher anew on the new data, because in general it will produce different features).
To do this either
FeatureHasher
and SelectKBest
separatelyor (better)
Pipeline
of FeatureHasher, SelectKBest, and RandomForestClassifier
and pickle the whole pipeline. Then you can load this pipeline and use predict
on a new data. Upvotes: 5