Python and SAS produce PCA data with same abs. values but inverted signs - why?

Question

I am building a Python 3 (pandas for data manipulation, numpy for PCA via SVD) to mimic some code I wrote in grad school. That code was in SAS 9.4 using PROC IML to call svd on a matrix of spectra. SAS code:

data Raman1;
infile "Combined SpectraC.csv" dsd firstobs=2;
input Wavenumber R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21;
run;
proc iml;
use Raman1;
read all var {R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21} into Raman1[colname=varname]; *Matrix is row=value at a wavenumber, column sample**;
close Raman1;
Raman=Raman1`;  *transpose;
mean1=mean(Raman);
std1=std(Raman);
RamanSVD=(Raman-mean1)/std1;
call svd (u,q,v,RamanSVD);

My python code (skipping in-take, spec1 is the same data called Raman1 in the SAS code)

(all abbreviations are normal, pd=pandas, np=numpy, la=numpy.linalg)

tspec=spec1.T
tspec_stats=tspec.describe()
tspec_stats=tspec_stats.drop(["count","min","25%", "50%","75%","max"], axis=0)

scale_spec=pd.DataFrame()
tsamplelist=tspec.columns.tolist()
for i in tsamplelist:
    scale_spec[i]=((tspec[i]-tspec_stats.iloc[0, i])/tspec_stats.iloc[1,i])

PCA_data=scale_spec.to_numpy()

u, s, vh= la.svd(PCA_data)

These produce nearly identical data, except that the sign is reversed on much of the data. Q (from SAS) and S (from Python) are identical, but for U and V columns 1-6, 8, 13-14, 16, 19 (out of 21 columns) have the same abs but different signs between SAS and Python (i.e. data that is positive in SAS are negative in Python and vice versa). The other columns (7, 9-12, 15, 17, 28, 20, and 21) have both the same abs value and the signs.

This isn't a problem per se: the general method of interpreting PCA of spectra is to reconstruct spectra using limited dimensions (the principle components); because signs are flipped on both sides of the data the reconstruction is the same. X+*X+ = X-*X- equal the same thing, and X+*X- = X-*X+ (assuming X is the same number).

This is a matter of my confidence in the data and a matter of me being able to defend my methods. Can anyone help me understand why these sign changes would have occurred? What is different about these processes?

PS: I did check. The data used are the exact same files.

Michael Baudin · Accepted Answer

This is mathematically because if x is an eigenvector, then -x is also an eigenvector with the same eigenvalue. Hence, the direction of an eigenvector is undetermined. Numerically, a small rounding error during the computation can make the direction change. It would be much simpler if the problem had been numerically standardized across libraries, but this is not the present situation so we must live with it. However, you can arbitrarily change the direction as a post-processing of any PCA routine, e.g. making the first component nonnegative: this is a legitimate fix of a lack of standardization.

Python and SAS produce PCA data with same abs. values but inverted signs - why?

Answers (1)

Related Questions