FeatureUnion with different feature dimensions

Question

I want to classify some sentences with sklearn. The sentences are stored in a Pandas DataFrame.

To begin, I want to use the length of the sentence and it's TF-IDF vectors as a feature, so I created this pipeline:

pipeline = Pipeline([
    ('features', FeatureUnion([
        ('meta', Pipeline([
            ('length', LengthAnalyzer())
        ])),
        ('bag-of-words', Pipeline([
            ('tfidf', TfidfVectorizer())
        ]))
    ])),
    ('model', LogisticRegression())

where the LengthAnalyzer is a custom TransformerMixinwith:

    def transform(self, documents):
        for document in documents:
            yield len(document)

So, LengthAnalyzer returns a number (1 dimension) while TfidfVectorizer returns a n-dimensional list.

When I try to run this, I get

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 494, expected 1.

What has to be done to make this feature combination work?

Vivek Kumar · Accepted Answer

Seems like the problem is originating from the yield used in the transform(). Maybe due to yield the number of rows reported to the scipy hstack method is 1 instead of actual number of samples in documents.

There should be 494 rows (samples) in your data which is coming correct from TfidfVectorizer but LengthAnalyzer is only reporting a single row. Hence the error.

If you can change it to

return np.array([len(document) for document in documents]).reshape(-1,1)

then the pipeline fits successfully.

Note: I tried finding any related issue on scikit-learn github but was unsuccessful. You can post this issue there to get some real feedback for the usage.

FeatureUnion with different feature dimensions

Answers (1)

Related Questions