Abdelrahman Abozied
Abdelrahman Abozied

Reputation: 23

Sklearn Pipeline and original model aren't the same answer "Fixed Output"

I'm developing simple text classification for SMS and the full model will be 3 steps:

  1. TextCleaning() "Custom function"
  2. TfidfVectorizer() "Vectorizer"
  3. MultinomialNB() "Classification model"

I wanted to merge the 3 steps in one model using sklearn.pipeline and save the model using joblib.dump, The problem is when load the saved model the output is fixed every time with any test or training data of spam class I got ham!

This is the custom function before Pipeline :

def TextCleaning(X):
    documents = []
    
    for sent in X:
        # Remove all single characters
        sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)
        
        # Substituting multiple spaces with single space
        sent = re.sub(r'\s+', ' ', sent, flags=re.I)
        
        doc = nlp(sent)
        
        document = [token.lemma_ for token in doc]
        
        document = ' '.join(document)
        
        documents.append(document)
    return documents

This is the coded of TextCleaning as a class for Pipeline :

class TextCleaning():
    def __init__(self):
        print("call init")
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        documents = []
        for sent in X:
            # Remove all single characters
            sent = re.sub(r'\s+[a-zA-Z]\s+', ' ', sent)

            # Substituting multiple spaces with single space
            sent = re.sub(r'\s+', ' ', sent, flags=re.I)

            doc = nlp(sent)

            document = [token.lemma_ for token in doc]

            document = ' '.join(document)

            documents.append(document)
            
        return documents

This is Pipeline code :

EmailClassification = Pipeline([('TextCleaning', TextCleaning()),
                                ('Vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
                                ('NB', MultinomialNB())])

The full notebook and data on Github Ham-or-Spam-SMS-Classification

Upvotes: 1

Views: 195

Answers (1)

StupidWolf
StupidWolf

Reputation: 46908

In your notebook, you are doing:

EmailClassification.predict("Congratulations, you won @ free rolex")

If you just provide your data as a string, the Pipeline will interpret it as a list of characters and try to predict each character, hence you get the same number of predictions as the length of your string:

EmailClassification.predict("Congratulations, you won @ free rolex")
array(['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham',
       'ham'], dtype='<U4')

It should be:

EmailClassification.predict(["Congratulations, you won @ free rolex"])
array(['spam'], dtype='<U4')

Upvotes: 1

Related Questions