Reputation: 415
I'm working on some customer_data
where I, as a first step, want to do PCA, followed by clustering as a second step.
Since there needs to be done encoding (and scaling) before feeding the data to the PCA, I thought it would be good to fit it all into a pipeline. - Which unfortunately doesn't seem to work.
How can I create this pipeline, and does it even make sense to do it like this?
# Creating pipeline objects
encoder = OneHotEncoder(drop='first')
scaler = StandardScaler(with_mean=False)
pca = PCA()
# Create pipeline
pca_pipe = make_pipeline(encoder,
scaler,
pca)
# Fit data to pipeline
pca_pipe.fit_transform(customer_data_raw)
I get the following error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-27-c4ce88042a66> in <module>()
20
21 # Fit data to pipeline
---> 22 pca_pipe.fit_transform(customer_data_raw)
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/decomposition/_pca.py in _fit(self, X)
385 # This is more informative than the generic one raised by check_array.
386 if issparse(X):
--> 387 raise TypeError('PCA does not support sparse input. See '
388 'TruncatedSVD for a possible alternative.')
389
TypeError: PCA does not support sparse input. See TruncatedSVD for a possible alternative.
Upvotes: 1
Views: 1825
Reputation: 12678
OneHotEncoder
creates a sparse matrix on transform by default. From there the error message is pretty straightforward: you can try TruncatedSVD
instead of PCA
. However, you could also set sparse=False
in the encoder if you want to stick to PCA
.
That said, do you really want to one-hot encode every feature? And then scale those dummy variables? Consider using a ColumnTransformer
if you'd like to encode some features and scale others.
Upvotes: 4