Why does PCA result change drastically with a small change in the input?

Question

I am using PCA to reduce an Nx3 array to an Nx2 array. This is mainly because the PCA transformation (Nx2 matrix) is invariant to the rotations or translations performed on the original Nx3 array. Let's take the following as an example.

import numpy as np
from sklearn.decomposition import PCA
a = np.array([[0.5  , 0.5  , 0.5  ],
              [0.332, 0.456, 0.751],
              [0.224, 0.349, 0.349],
              [0.112, 0.314, 0.427]])
pca = PCA(n_components=2, svd_solver='full', random_state=10)
print(pca.fit_transform(a))

The following is the output. Note that we get the same output with print(pca.fit_transform(a-L)), L being any number, due to translational invariance. Same with rotations.

[[ 0.16752654  0.15593431]
 [ 0.20568992 -0.14688601]
 [-0.16899598  0.06364857]
 [-0.20422047 -0.07269687]]

Now, I give a very small perturbation (~1%) to the array a and perform PCA.

a_p = np.array([[0.51 , 0.53 , 0.52 ],
       [0.322, 0.452, 0.741],
       [0.217, 0.342, 0.339],
       [0.116, 0.31 , 0.417]])
pca = PCA(n_components=2, svd_solver='full', random_state=10)
print(pca.fit_transform(a_p))

The result is as follows. This is quite different from the PCA of the original array.

 [[-0.2056024 , -0.14346977]
 [-0.18563578  0.15627932]
 [ 0.17974942 -0.07001969]
 [ 0.21148876  0.05721014]]

I expected the PCA transformation of the perturbed array to be very similar to that of the original array, but the percentage change is huge. Why is this? Is there any way I could obtain a very similar PCA transformation for a slightly perturbed/shaked array?

I understand that I can get a similar PCA by performing only transform operation in the second case (e.g. pca.transform(a_p)), however, in this case, I lose rotational and translational invariance w.r.t. a_p.

The question is originally related to crystallography and my requirement is that the PCA (or whatever) transformation should not change significantly to a small change in the input as well as it should be invariant to the rotations and transformations of the input. Can anyone explain the above or suggest me an alternative method which serves my purpose?

dataista · Accepted Answer

You are getting the vector with the sign shifted as the principal components.

See the following code. I just grabbed the 2 PCA instances as pca1 and pca2 to access their components_ attribute:


import numpy as np
from sklearn.decomposition import PCA
a = np.array([[0.5  , 0.5  , 0.5  ],
              [0.332, 0.456, 0.751],
              [0.224, 0.349, 0.349],
              [0.112, 0.314, 0.427]])
pca1 = PCA(n_components=2, svd_solver='full', random_state=10)
print(pca1.fit_transform(a))

a_p = np.array([[0.51 , 0.53 , 0.52 ],
       [0.322, 0.452, 0.741],
       [0.217, 0.342, 0.339],
       [0.116, 0.31 , 0.417]])
pca2 = PCA(n_components=2, svd_solver='full', random_state=10)
print(pca2.fit_transform(a_p))


pca1.components_
array([[ 0.64935364,  0.38718276,  0.65454515],
       [ 0.63947417,  0.18783695, -0.74551329]])

pca2.components_
array([[-0.65743254, -0.42817638, -0.62003826],
       [-0.59052329, -0.21834821,  0.77692104]])

As you can see, the PCs are pointing in similar directions, but you got the opposite sign.

For example, see that the PC1 for pca1 is [ 0.64935364, 0.38718276, 0.65454515] while the PC1 for pca2 is [-0.65743254, -0.42817638, -0.62003826]. Ignoring the signs, the differences among each coordinate are relatively small... around 2%, 10%, and 5% according to my calculations.

This matches your intuition that "they should be relatively close".

The key insight here is that the vector [-0.65743254, -0.42817638, -0.62003826] and the vector [0.65743254, 0.42817638, 0.62003826] are on the same line in the space, but just "pointing" in different directions. And, thus, both are equally valid principal components for PCA to find.

I don't know about any way to enforce sklearn to produce vectors pointing in the same quadrant.

This explains most of the distance between your points, it's a "sign" distance. The rest is explained due to the variance you introduced.

A quick solution could be to switch the sign of the results for the PCA transformation of a_p.

A positive aspect of the "sign problem" is that actually you can switch the signs of the embedded values without information loss.

So you would do something like this:


t1 = pca1.fit_transform(a)
t2 = pca2.fit_transform(a_p)


t2 = -t2 # Change signs

t1
array([[ 0.16752654,  0.15593431],
       [ 0.20568992, -0.14688601],
       [-0.16899598,  0.06364857],
       [-0.20422047, -0.07269687]])

t2
array([[ 0.2056024 ,  0.14346977],
       [ 0.18563578, -0.15627932],
       [-0.17974942,  0.07001969],
       [-0.21148876, -0.05721014]])

Where t1 and t2 are roughly similar as your intuition originally -and correctly - suggested.

Why does PCA result change drastically with a small change in the input?

Answers (1)

Related Questions