Reputation: 23
I'm developing simple text classification for SMS and the full model will be 3 steps:
I wanted to merge the 3 steps in one model using sklearn.pipeline
and save the model using joblib.dump
, The problem is when load the saved model the output is fixed every time with any test or training data of spam class I got ham!
This is the custom function before Pipeline
:
def TextCleaning(X):
documents = []
for sent in X:
# Remove all single characters
sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)
# Substituting multiple spaces with single space
sent = re.sub(r'\s+', ' ', sent, flags=re.I)
doc = nlp(sent)
document = [token.lemma_ for token in doc]
document = ' '.join(document)
documents.append(document)
return documents
This is the coded of TextCleaning as a class for Pipeline
:
class TextCleaning():
def __init__(self):
print("call init")
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
documents = []
for sent in X:
# Remove all single characters
sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)
# Substituting multiple spaces with single space
sent = re.sub(r'\s+', ' ', sent, flags=re.I)
doc = nlp(sent)
document = [token.lemma_ for token in doc]
document = ' '.join(document)
documents.append(document)
return documents
This is Pipeline
code :
EmailClassification = Pipeline([('TextCleaning', TextCleaning()),
('Vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
('NB', MultinomialNB())])
The full notebook and data on Github Ham-or-Spam-SMS-Classification
Upvotes: 1
Views: 195
Reputation: 46908
In your notebook, you are doing:
EmailClassification.predict("Congratulations, you won @ free rolex")
If you just provide your data as a string, the Pipeline will interpret it as a list of characters and try to predict each character, hence you get the same number of predictions as the length of your string:
EmailClassification.predict("Congratulations, you won @ free rolex")
array(['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
'ham'], dtype='<U4')
It should be:
EmailClassification.predict(["Congratulations, you won @ free rolex"])
array(['spam'], dtype='<U4')
Upvotes: 1