Clay
Clay

Reputation: 2999

scikit-learn PCA: matrix transformation produces PC estimates with flipped signs

I'm using scikit-learn to perform PCA on this dataset. The scikit-learn documentation states that

Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.

The problem is that I don't think that I'm using different estimator objects, but the signs of some of my PCs are flipped, when compared to results in SAS's PROC PRINCOMP procedure.

For the first observation in the dataset, the SAS PCs are:

PC1      PC2      PC3       PC4      PC5
2.0508   1.9600   -0.1663   0.2965   -0.0121

From scikit-learn, I get the following (which are very close in magnitude):

PC1      PC2      PC3       PC4      PC5
-2.0536  -1.9627  -0.1666   -0.297   -0.0122

Here's what I'm doing:

import pandas as pd
import numpy  as np
from sklearn.decomposition.pca import PCA

sourcef = pd.read_csv('C:/mydata.csv')
frame = pd.DataFrame(sourcef)

# Some pandas evals, regressions, etc... that I'm not showing
# but not affecting the matrix

# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]

# Separate out the data matrix from the response variable vector 
# into numpy arrays
frame2_X = frame[cols].values
frame2_y = frame['response'].values

# Standardize the values
X_means = np.mean(frame2_X,axis=0)
X_stds  = np.std(frame2_X,axis=0)

y_mean = np.mean(frame2_y)
y_std  = np.std(frame2_y)

frame2_X_stdz = np.copy(frame2_X)
frame2_y_stdz = frame2_y.astype(numpy.float32, copy=True)

for (x,y), value in np.ndenumerate(frame2_X_stdz):
    frame2_X_stdz[x][y] = (value - X_means[y])/X_stds[y]

for index, value in enumerate(frame2_y_stdz):
    frame2_y_stdz[index] = (float(value) - y_mean)/y_std

# Show the first 5 elements of the standardized values, to verify
print frame2_X_stdz[:,0][:5]

# Show the first 5 lines from the standardized response vector, to verify
print frame2_y_stdz[:5]

Those check out ok:

[ 0.9508 -0.5847 -0.2797 -0.4039 -0.598 ]
[ 1.0726 -0.5009 -0.0942 -0.1187 -0.8043]

Continuing on...

# Create a PCA object
pca = PCA()
pca.fit(frame2_X_stdz)

# Create the matrix of PC estimates
pca.transform(frame2_X_stdz)

Here's the output of the last step:

Out[16]: array([[-2.0536, -1.9627, -0.1666, -0.297 , -0.0122],
       [ 1.382 , -0.382 , -0.5692, -0.0257, -0.0509],
       [ 0.4342,  0.611 ,  0.2701,  0.062 , -0.011 ],
       ..., 
       [ 0.0422,  0.7251, -0.1926,  0.0089,  0.0005],
       [ 1.4502, -0.7115, -0.0733,  0.0013, -0.0557],
       [ 0.258 ,  0.3684,  0.1873,  0.0403,  0.0042]])

I've tried it by replacing the pca.fit() and pca.transform() with pca.fit_transform(), but I end up with the same results.

What am I doing wrong here that I'm getting PCs with the signs flipped?

Upvotes: 7

Views: 3747

Answers (2)

Kyle Kastner
Kyle Kastner

Reputation: 1018

SVD decompositions are not guaranteed unique - only the values will be identical, as different implementations of svd() can produce different signs. Any of the eigenvectors can have flipped signs, and will produce identical results when transformed, then transformed back into the original space. Most algorithms in sklearn which use SVD decomposition use the function sklearn.utils.extmath.svd_flip() to correct this, and enforce an identical convention across algorithms. For historical reasons, PCA() never got this fix (though maybe it should...)

In general, this is not something to worry about - just a limitation of the SVD algorithm as typically implemented.

On an additional note, I find assigning importance to PC weights (and parameter weights in general) dangerous, because of exactly these kinds of issues. Numerical/implementation details should not influence your analysis results, but many times it is hard to tell what is a result of the data, and what is a result of the algorithms you use for exploration. I know this is a homework assignment, not a choice, but it is important to keep these things in mind!

Upvotes: 3

loopbackbee
loopbackbee

Reputation: 23342

You're doing nothing wrong.

What the documentation is warning you about is that repeated calls to fit may yield different principal components - not how they relate to another PCA implementation.

Having a flipped sign on all components doesn't make the result wrong - the result is right as long as it fulfills the definition (each component is chosen such that it captures the maximum amount of variance in the data). As it stands, it seems the projection you got is simply mirrored - it still fulfills the definition, and is, thus, correct.

If, beneath correctness, you're worried about consistency between implementations, you can simply multiply the components by -1, when it's necessary.

Upvotes: 8

Related Questions