Reputation: 75
I'm trying to optimize an unsupervised kernel PCA algorithm. Here is some context:
Another approach, this time entirely unsupervised, is to select the kernel and hyperparameters that yield the lowest reconstruction error. However, reconstruction is not as easy as with linear PCA
....
Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This is called the reconstruction pre-image. Once you have this pre-image, you can measure its squared distance to the original instance. You can then select the kernel and hyperparameters that minimize this reconstruction pre-image error.
One solution is to train a supervised regression model, with the projected instances as the training set and the original instances as the targets.
Now you can use grid search with cross-validation to find the kernel and hyperparameters that minimize this pre-image reconstruction error.
The code provided in the book to perfom the reconstruction without cross validation is:
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433,fit_inverse_transform=True)
X_reduced = rbf_pca.fit_transform(X)
X_preimage = rbf_pca.inverse_transform(X_reduced)
>>> from sklearn.metrics import mean_squared_error
>>> mean_squared_error(X, X_preimage)
32.786308795766132
My question is, how do i go about implementing cross validation to tune the kernel and hyperparameters to minimze the pre-image reconstruction error?
Here is my go at it so far:
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import KernelPCA
mean_squared_error(X, X_preimage)
kpca=KernelPCA(fit_inverse_transform=True, n_jobs=-1)
from sklearn.model_selection import GridSearchCV
param_grid = [{
"kpca__gamma": np.linspace(0.03, 0.05, 10),
"kpca__kernel": ["rbf", "sigmoid", "linear", "poly"]
}]
grid_search = GridSearchCV(clf, param_grid, cv=3, scoring='mean_squared_error')
X_reduced = kpca.fit_transform(X)
X_preimage = kpca.inverse_transform(X_reduced)
grid_search.fit(X,X_preimage)
Thank you
Upvotes: 3
Views: 6510
Reputation: 580
I have created a class that does hyperparameters search on KernelPCA
class.
Look: https://gist.github.com/Kemsekov/1a8a95b93b2388f9d86d9fc33c7f9577
It automatically search kernel and parameters for kernel. Use it like this
from kernel_pca_search import KernelPCASearchCV
# create class
kpca_cv = KernelPCASearchCV(n_components=3)
# get low-dimensional features
x_transform = kpca_cv.fit_transform(X)
print("r2 score of dimensionality reduction performance ",kpca_cv.score)
Upvotes: 0
Reputation: 36619
GridSearchCV
is capable of doing cross-validation of unsupervised learning (without a y
) as can be seen here in documentation:
fit(X, y=None, groups=None, **fit_params)
... y : array-like, shape = [n_samples] or [n_samples, n_output], optional Target relative to X for classification or regression; None for unsupervised learning ...
So the only thing that needs to be handled is how the scoring
will be done.
The following will happen in GridSearchCV:
The data X
will be be divided into train-test splits based on folds defined in cv
param
For each combination of parameters that you specified in param_grid
, the model will be trained on the train
part from the step above and then scoring
will be used on test
part.
The scores
for each parameter combination will be combined for all the folds and averaged. Highest performing parameter combination will be selected.
Now the tricky part is 2. By default, if you provide a 'string'
in that, it will be converted to a make_scorer
object internally. For 'mean_squared_error'
the relevant code is here:
....
neg_mean_squared_error_scorer = make_scorer(mean_squared_error,
greater_is_better=False)
....
which is what you dont want, because that requires y_true
and y_pred
.
The other option is to make your own custom scorer as discussed here with signature (estimator, X, y)
. Something like below for your case:
from sklearn.metrics import mean_squared_error
def my_scorer(estimator, X, y=None):
X_reduced = estimator.transform(X)
X_preimage = estimator.inverse_transform(X_reduced)
return -1 * mean_squared_error(X, X_preimage)
Then use it in GridSearchCV like this:
param_grid = [{
"gamma": np.linspace(0.03, 0.05, 10),
"kernel": ["rbf", "sigmoid", "linear", "poly"]
}]
kpca=KernelPCA(fit_inverse_transform=True, n_jobs=-1)
grid_search = GridSearchCV(kpca, param_grid, cv=3, scoring=my_scorer)
grid_search.fit(X)
Upvotes: 5