lup3x
lup3x

Reputation: 249

Combining heterogenous features in scikit-learn

I'm doing binary classification over some documents whose features are already extracted and given in a text file. My problem is that there are textual features and numerical features like years and some other. One sample is given in this format:

label |title text |otherText text |numFeature1 number |numFeature2 number

I'm following the documentation about feature unions but their use case is a bit different. I do not extract the the features from another feature because these numerical features are already given.

Currently I'm using the setup in the following way:

pipeline = Pipeline([
('features', Features()),

('union', FeatureUnion(
    transformer_list=[
        ('title', Pipeline([
            ('selector', ItemSelector(key='title')),
            ('tfidf', TfidfVectorizer()),
        ])),
        ('otherText', Pipeline([
            ('selector', ItemSelector(key='otherText')),
            ('tfidf', TfidfVectorizer()),
        ])),
        ('numFeature1', Pipeline([
            ('selector', ItemSelector(key='numFeature1')),
        ])),
        ('numFeature2', Pipeline([
            ('selector', ItemSelector(key='numFeature2')),
        ])),
    ],
)),
('classifier', MultinomialNB()),
])

The Feature class is also adopted from the documentation:

class Features(BaseEstimator, TransformerMixin):
  def fit(self, x, y=None):
    return self

  def transform(self, posts):
    features = np.recarray(shape=(len(posts),),
                           dtype=[('title', object),('otherText', object),
                                  ('numFeature1', object),('numFeature2', object)])

    for i, text in enumerate(posts):
        l = re.split("\|\w+", text)
        features['title'][i] = l[1]
        features['otherText'][i] = l[2]
        features['numFeature1'][i] = l[3]
        features['numFeature2'][i] = l[4]

    return features

My Problem is now: How do I add the numerical features into the FeatureUnion? When using a CountVectorizer i get "ValueError: empty vocabulary; perhaps the documents only contain stop words" and using a DictVectorizer with only one entry doesn't strike me as the way to go.

Upvotes: 2

Views: 1877

Answers (2)

Vivek Kumar
Vivek Kumar

Reputation: 36609

What ItemSelector() does is, picks data from the given dict(X) according to the key supplied in the constructor and returns a one dimensional [n,] array.

This type of [n,] array is not handled correctly by the FeatureUnion. FeatureUnion requires an array of 2 dimensions from each of its internal transformers in which the 1st dimension (number of samples) should be consistent, which can be finally stacked horizontally to combine the features.

The second operation in your first two transformers (TfidfVectorizer()) takes this [n,] array from ItemSelector and outputs a valid [n,m] type of array where m = number of features extracted from raw text.

But your 3rd and 4th transformers contain only ItemSelector(), so it outputs [n,] array. That is the cause of errors.

To correct this you should reshape the output of ItemSelector to [n,1]. Change the following code in ItemSelector.transform() (I'm assuming you are using ItemSelector code from the link you specified):

Original

data_dict[self.key]

New

data_dict[self.key].reshape((-1,1))

The reshape() will format your [n,] to [n,1] which can then be used by the FeatureUnion to append the data correctly.

Upvotes: 0

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

the TfidfVectorizer() object has not been fitted with data yet.

Before constructing the pipeline, do this -

vec = TfidfVectorizer()
vec.fit(data['free text column'])
pipeline = Pipeline([
('features', Features()),

('union', FeatureUnion(
    transformer_list=[
        ('title', Pipeline([
            ('selector', ItemSelector(key='title')),
            ('tfidf', vec),
        ])),

        ... other features

This helps if you want to fit your data again for test purposes... because for test data the pipeline would automatically use transform() function for the TfidfVectorizer instead of fit() function which you have to explicitly do before constructing the pipeline

Upvotes: 0

Related Questions