Using Word2Vec in scikit-learn pipeline

Question

I am trying to run the w2v on this sample of data

Statement              Label
Says the Annies List political group supports third-trimester abortions on demand.       FALSE
When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.         TRUE
"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran."""     TRUE
Health care reform legislation is likely to mandate free sex change surgeries.    FALSE
The economic turnaround started at the end of my term.     TRUE
The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.    TRUE
Jim Dunnam has not lived in the district he represents for years now.    FALSE

using the code provided in this GitHub folder (FeatureSelection.py):

https://github.com/nishitpatel01/Fake_News_Detection

I would like to include word2vec features in my Naive Bayes model. First I considered X and y and used train_test_split:

X = df['Statement']
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

dataset = pd.concat([X_train, y_train], axis=1)

This is the code I am currently using:

#Using Word2Vec 
with open("glove.6B.50d.txt", "rb") as lines:
    w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
           for line in lines}

training_sentences = DataPrep.train_news['Statement']

model = gensim.models.Word2Vec(training_sentences, size=100) # x be tokenized text
w2v = dict(zip(model.wv.index2word, model.wv.syn0))


class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(word2vec.itervalues().next())

    def fit(self, X, y): # what are X and y?
        return self

    def transform(self, X): # should it be training_sentences?
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])


"""
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())
    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
        return self
    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])
"""

and in classifier.py, I am running

nb_pipeline = Pipeline([
        ('NBCV',FeaturesSelection.w2v),
        ('nb_clf',MultinomialNB())])

However this is not working and I am getting this error:

TypeError                                 Traceback (most recent call last)
 in 
      2 nb_pipeline = Pipeline([
      3         ('NBCV',FeaturesSelection.w2v),
----> 4         ('nb_clf',MultinomialNB())])

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
    112         self.memory = memory
    113         self.verbose = verbose
--> 114         self._validate_steps()
    115 
    116     def get_params(self, deep=True):

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_steps(self)
    160                                 "transformers and implement fit and transform "
    161                                 "or be the string 'passthrough' "
--> 162                                 "'%s' (type %s) doesn't" % (t, type(t)))
    163 
    164         # We allow last estimator to be None as an identity transformation

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '{' ': array([-0.17019527,  0.32363772, -0.0770281 , -0.0278154 , -0.05182227, ....

I am using all the programs in that folder, so the code can be reproducible if you use them.

If you could explain me how to fix it and what other changes in the code would be necessary, it would be great. My goal is to compare models (naive bayes, random forest,...) with BoW, TF-IDF and Word2Vec.

Update:

After the answer below (from Ismail), I updated the code as follows:

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, size=100):
        self.word2vec = word2vec
        self.dim = size

and

#building Linear SVM classfier
svm_pipeline = Pipeline([
        ('svmCV',FeaturesSelection_W2V.MeanEmbeddingVectorizer(FeaturesSelection_W2V.w2v)),
        ('svm_clf',svm.LinearSVC())
        ])

svm_pipeline.fit(DataPrep.train_news['Statement'], DataPrep.train_news['Label'])
predicted_svm = svm_pipeline.predict(DataPrep.test_news['Statement'])
np.mean(predicted_svm == DataPrep.test_news['Label'])

However, I am still getting errors.

Ismail Durmaz · Accepted Answer

Step 1. MultinomialNB FeaturesSelection.w2v is a dict and it does not have fit or fit_transform functions. Also MultinomialNB needs non-negative values, so it doesn't work. So I decided to add a pre-processing stage to normalize negative values.

from sklearn.preprocessing import MinMaxScaler

nb_pipeline = Pipeline([
        ('NBCV',MeanEmbeddingVectorizer(FeatureSelection.w2v)),
        ('nb_norm', MinMaxScaler()),
        ('nb_clf',MultinomialNB())
    ])

... instead of

nb_pipeline = Pipeline([
        ('NBCV',FeatureSelection.w2v),
        ('nb_clf',MultinomialNB())
    ])

Step 2. I have got an error on word2vec.itervalues().next(). So I have decided to change dimension shape with predefined that is the same value of Word2Vec's size.

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, size=100):
        self.word2vec = word2vec
        self.dim = size

... instead of

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())

Using Word2Vec in scikit-learn pipeline

Answers (1)

Related Questions