Reputation: 3016
I want to classify some sentences with sklearn. The sentences are stored in a Pandas DataFrame.
To begin, I want to use the length of the sentence and it's TF-IDF vectors as a feature, so I created this pipeline:
pipeline = Pipeline([
('features', FeatureUnion([
('meta', Pipeline([
('length', LengthAnalyzer())
])),
('bag-of-words', Pipeline([
('tfidf', TfidfVectorizer())
]))
])),
('model', LogisticRegression())
where the LengthAnalyzer is a custom TransformerMixin
with:
def transform(self, documents):
for document in documents:
yield len(document)
So, LengthAnalyzer returns a number (1 dimension) while TfidfVectorizer returns a n-dimensional list.
When I try to run this, I get
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 494, expected 1.
What has to be done to make this feature combination work?
Upvotes: 1
Views: 847
Reputation: 36619
Seems like the problem is originating from the yield
used in the transform(). Maybe due to yield
the number of rows reported to the scipy hstack
method is 1 instead of actual number of samples in documents
.
There should be 494 rows (samples) in your data which is coming correct from TfidfVectorizer
but LengthAnalyzer
is only reporting a single row. Hence the error.
If you can change it to
return np.array([len(document) for document in documents]).reshape(-1,1)
then the pipeline fits successfully.
Note: I tried finding any related issue on scikit-learn github but was unsuccessful. You can post this issue there to get some real feedback for the usage.
Upvotes: 3