Reputation: 3043
I am using PCA
(Principal Component Analysis) from sklearn library. The training sets that I am working with have the following shapes: X_train: (124, 13), y_train: (124, )
. The test sets have the following shapes: X_test: (54, 13), y_test: (54, )
.
This is how I am doing the PCA
:
from sklearn.decomposition import PCA
pca = PCA(0.75) #75 % variance retained
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
print X_train_pca.shape, X_test_pca.shape, y_train.shape, y_test.shape
>>> (124, 5), (54, 5), (124,), (54,)
To test the goodness of the results obtained from Principal Component Analysis, I use logistic regression first.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(X_train_pca, y_train)
And, I use score
from LogisticRegression
to find the efficacy of the transformation and the mean accuracy of the fit using test data set:
print lr.score(X_test_pca, y_test)
>>> 0.9814814814814815
However, when I use score
from PCA (sklearn)
, I encounter errors:
print pca.score(X_test_pca, y=None)
---------------------------------------------------------------------------
ValueError
Traceback (most recent call last)
<ipython-input-217-540210963ed0> in <module>()
----> 3 print pca.score(X_test_pca, y=None)
/Users/username/.local/lib/python2.7/site-packages/sklearn/decomposition/pca.pyc in score(self, X, y)
529 Average log-likelihood of the samples under the current model
530 """
--> 531 return np.mean(self.score_samples(X))
532
533
/Users/username/.local/lib/python2.7/site-packages/sklearn/decomposition/pca.pyc in score_samples(self, X)
503
504 X = check_array(X)
--> 505 Xr = X - self.mean_
506 n_features = X.shape[1]
507 log_like = np.zeros(X.shape[0])
ValueError: operands could not be broadcast together with shapes (54,5) (13,)
What am I doing wrong? How can I test the goodness of results of PCA
in X_test
(and y_test
)?
Upvotes: 2
Views: 4022
Reputation: 36599
For PCA.score()
, you will need to use the original test data. Currently you are sending X_test_pca
into it, which is already transformed by it.
For score()
function in any scikit-learn method, you will need the type of data that you used in fit()
function. Not the transformed output. PCA
will automatically transform the original data inside the score()
method and then calculate the log-likelihood.
Change this:
pca.score(X_test_pca, y=None)
to this:
pca.score(X_test_std, y=None)
Upvotes: 3