Siddharth Satpathy
Siddharth Satpathy

Reputation: 3043

Error when using scikit-learn PCA.score()

I am using PCA (Principal Component Analysis) from sklearn library. The training sets that I am working with have the following shapes: X_train: (124, 13), y_train: (124, ). The test sets have the following shapes: X_test: (54, 13), y_test: (54, ).

This is how I am doing the PCA:

from sklearn.decomposition import PCA

pca = PCA(0.75) #75 % variance retained
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

print  X_train_pca.shape, X_test_pca.shape, y_train.shape, y_test.shape

>>> (124, 5), (54, 5), (124,), (54,)

To test the goodness of the results obtained from Principal Component Analysis, I use logistic regression first.

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(X_train_pca, y_train)

And, I use score from LogisticRegression to find the efficacy of the transformation and the mean accuracy of the fit using test data set:

print lr.score(X_test_pca, y_test)
>>> 0.9814814814814815

However, when I use score from PCA (sklearn), I encounter errors:

print pca.score(X_test_pca, y=None)

---------------------------------------------------------------------------
ValueError                                
Traceback (most recent call last)
<ipython-input-217-540210963ed0> in <module>()
----> 3 print pca.score(X_test_pca, y=None)

/Users/username/.local/lib/python2.7/site-packages/sklearn/decomposition/pca.pyc in score(self, X, y)
    529             Average log-likelihood of the samples under the current model
    530         """
--> 531         return np.mean(self.score_samples(X))
    532 
    533 

/Users/username/.local/lib/python2.7/site-packages/sklearn/decomposition/pca.pyc in score_samples(self, X)
    503 
    504         X = check_array(X)
--> 505         Xr = X - self.mean_
    506         n_features = X.shape[1]
    507         log_like = np.zeros(X.shape[0])

ValueError: operands could not be broadcast together with shapes (54,5) (13,) 

What am I doing wrong? How can I test the goodness of results of PCA in X_test (and y_test)?

Upvotes: 2

Views: 4022

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36599

For PCA.score(), you will need to use the original test data. Currently you are sending X_test_pca into it, which is already transformed by it.

For score() function in any scikit-learn method, you will need the type of data that you used in fit() function. Not the transformed output. PCA will automatically transform the original data inside the score() method and then calculate the log-likelihood.

Change this:

pca.score(X_test_pca, y=None)

to this:

pca.score(X_test_std, y=None)

Upvotes: 3

Related Questions