Reputation: 992
When using make_column_transformer() in sklearn pipeline, I encountered an error when trying to use CountVectorizer.
My DataFrame has two columns, 'desc-title'
and 'SPchangeHigh'
.
Here's a snippet of two rows:
features = pd.DataFrame([["T. Rowe Price sells most of its Tesla shares", .002152],
["Gannett to retain all seats in MNG proxy fight", 0.002152]],
columns=["desc-title", "SPchangeHigh"])
I am able to run the following pipeline with no issue:
preprocess = make_column_transformer(
(StandardScaler(),['SPchangeHigh']),
( OneHotEncoder(),['desc-title'])
)
preprocess.fit_transform(features.head(2))
however when I replace OneHotEncoder() with CountVectorizer(tokenizer=tokenize), it fails:
preprocess = make_column_transformer(
(StandardScaler(),['SPchangeHigh']),
( CountVectorizer(tokenizer=tokenize),['desc-title'])
)
preprocess.fit_transform(features.head(2))
and the error that I get is this:
ValueError Traceback (most recent call last)
<ipython-input-71-d77f136b9586> in <module>()
3 ( CountVectorizer(tokenizer=tokenize),['desc-title'])
4 )
----> 5 preprocess.fit_transform(features.head(2))
C:\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in fit_transform(self, X, y)
488 self._validate_output(Xs)
489
--> 490 return self._hstack(list(Xs))
491
492 def transform(self, X):
C:\anaconda3\lib\site-packages\sklearn\compose\_column_transformer.py in _hstack(self, Xs)
545 else:
546 Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 547 return np.hstack(Xs)
548
549
C:\anaconda3\lib\site-packages\numpy\core\shape_base.py in hstack(tup)
338 return _nx.concatenate(arrs, 0)
339 else:
--> 340 return _nx.concatenate(arrs, 1)
341
342
ValueError: all the input array dimensions except for the concatenation axis must match exactly
I appreciate if anyone can help me.
Upvotes: 3
Views: 1036
Reputation: 1142
Remove the brackets around 'desc-title'. You want a one-dimensional array, not a column vector.
preprocess = make_column_transformer(
(StandardScaler(),['SPchangeHigh']),
( CountVectorizer(),'desc-title')
)
preprocess.fit_transform(features.head(2))
Sklearn documentation describes this nuanced specification:
The difference between specifying the column selector as 'column' (as a simple string) and ['column'] (as a list with one element) is the shape of the array that is passed to the transformer. In the first case, a one dimensional array will be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector
...
Be aware that some transformers expect a 1-dimensional input (the label-oriented ones) while some others, like OneHotEncoder or Imputer, expect 2-dimensional input, with the shape [n_samples, n_features].
Upvotes: 8