Jon
Jon

Reputation: 12864

Implementing a PCA (Eigenvector based) in Python

I try to implement a PCA in Python. My goal is to create a version which behaves similarly to Matlab's PCA implementation. However, I think I miss a crucial point as my tests partly produce a results with the wrong sign(+/-).

Can you find a mistake the algorithm? Why the signs are sometimes different?

An implementation of PCA based on eigen vectors:

new_array_rank=4
A_mean = np.mean(A, axis=0)    
A = A - A_mean    

covariance_matrix = np.cov(A.T)    

eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)
new_index = np.argsort(eigen_values)[::-1]
eigen_vectors = eigen_vectors[:,new_index]
eigen_values = eigen_values[new_index]

eigen_vectors = eigen_vectors[:,:new_array_rank]
return np.dot(eigen_vectors.T, A.T).T

My test values:

array([[ 0.13298325,  0.2896928 ,  0.53589224,  0.58164269,  0.66202221,
     0.95414116,  0.03040784,  0.26290471,  0.40823539,  0.37783385],
   [ 0.90521267,  0.86275498,  0.52696221,  0.15243867,  0.20894357,
     0.19900414,  0.50607341,  0.53995902,  0.32014539,  0.98744942],
   [ 0.87689087,  0.04307512,  0.45065793,  0.29415066,  0.04908066,
     0.98635538,  0.52091338,  0.76291385,  0.97213094,  0.48815925],
   [ 0.75136801,  0.85946751,  0.10508436,  0.04656418,  0.08164919,
     0.88129981,  0.39666754,  0.86325704,  0.56718669,  0.76346602],
   [ 0.93319721,  0.5897521 ,  0.75065047,  0.63916306,  0.78810679,
     0.92909485,  0.23751963,  0.87552313,  0.37663086,  0.69010429],
   [ 0.53189229,  0.68984247,  0.46164066,  0.29953259,  0.10826334,
     0.47944168,  0.93935082,  0.40331874,  0.18541041,  0.35594587],
   [ 0.36399075,  0.00698617,  0.61030608,  0.51136309,  0.54185601,
     0.81383604,  0.50003674,  0.75414875,  0.54689801,  0.9957493 ],
   [ 0.27815017,  0.65417397,  0.57207255,  0.54388744,  0.89128334,
     0.3512483 ,  0.94441934,  0.05305929,  0.77389942,  0.93125228],
   [ 0.80409485,  0.2749575 ,  0.22270875,  0.91869706,  0.54683128,
     0.61501493,  0.7830902 ,  0.72055598,  0.09363186,  0.05103846],
   [ 0.12357816,  0.29758902,  0.87807485,  0.94348706,  0.60896429,
     0.33899019,  0.36310027,  0.02380186,  0.67207071,  0.28638936]])

My result of the PCA with eigen vectors:

array([[  5.09548931e-01,  -3.97079651e-01,  -1.47555867e-01,
     -3.55343967e-02,  -4.92125732e-01,  -1.78191399e-01,
     -3.29543974e-02,   3.71406504e-03,   1.06404170e-01,
     -1.66533454e-16],
   [ -5.15879041e-01,   6.40833419e-01,  -7.54601587e-02,
     -2.00776798e-01,  -7.07247669e-02,   2.68582368e-01,
     -1.66124362e-01,   1.03414828e-01,   7.76738500e-02,
      5.55111512e-17],
   [ -4.42659342e-01,  -5.13297786e-01,  -1.65477203e-01,
      5.33670847e-01,   2.00194213e-01,   2.06176265e-01,
      1.31558875e-01,  -2.81699724e-02,   6.19571305e-02,
     -8.32667268e-17],
   [ -8.50397468e-01,   5.14319846e-02,  -1.46289906e-01,
      6.51133920e-02,  -2.83887201e-01,  -1.90516618e-01,
      1.45748370e-01,   9.49464768e-02,  -1.05989648e-01,
      4.16333634e-17],
   [ -1.61040296e-01,  -3.47929944e-01,  -1.19871598e-01,
     -6.48965493e-01,   7.53188055e-02,   1.31730340e-01,
      1.33229858e-01,  -1.43587499e-01,  -2.20913989e-02,
     -3.40005801e-16],
   [ -1.70017435e-01,   4.22573148e-01,   4.81511942e-01,
      2.42170125e-01,  -1.18575764e-01,  -6.87250591e-02,
     -1.20660307e-01,  -2.22865482e-01,  -1.73666882e-02,
     -1.52655666e-16],
   [  6.90841779e-02,  -2.86233901e-01,  -4.16612350e-01,
      9.38935057e-03,   3.02325120e-01,  -1.61783482e-01,
     -3.55465509e-01,   1.15323059e-02,  -5.04619674e-02,
      4.71844785e-16],
   [  5.26189089e-01,   6.81324113e-01,  -2.89960115e-01,
      2.01781673e-02,   3.03159463e-01,  -2.11777986e-01,
      2.25937548e-01,  -5.49219872e-05,   3.66268329e-02,
     -1.11022302e-16],
   [  6.68680313e-02,  -2.99715813e-01,   8.53428694e-01,
     -1.30066853e-01,   2.31410283e-01,  -1.02860624e-01,
      1.95449586e-02,   1.30218425e-01,   1.68059569e-02,
      2.22044605e-16],
   [  9.68303353e-01,   4.80944309e-02,   2.62865615e-02,
      1.44821658e-01,  -1.47094421e-01,   3.07366196e-01,
      1.91849667e-02,   5.08517759e-02,  -1.03558238e-01,
      1.38777878e-16]])

Test result of the same data using Matlab's PCA function:

array([[ -5.09548931e-01,   3.97079651e-01,   1.47555867e-01,
      3.55343967e-02,  -4.92125732e-01,  -1.78191399e-01,
     -3.29543974e-02,  -3.71406504e-03,  -1.06404170e-01,
     -0.00000000e+00],
   [  5.15879041e-01,  -6.40833419e-01,   7.54601587e-02,
      2.00776798e-01,  -7.07247669e-02,   2.68582368e-01,
     -1.66124362e-01,  -1.03414828e-01,  -7.76738500e-02,
     -0.00000000e+00],
   [  4.42659342e-01,   5.13297786e-01,   1.65477203e-01,
     -5.33670847e-01,   2.00194213e-01,   2.06176265e-01,
      1.31558875e-01,   2.81699724e-02,  -6.19571305e-02,
     -0.00000000e+00],
   [  8.50397468e-01,  -5.14319846e-02,   1.46289906e-01,
     -6.51133920e-02,  -2.83887201e-01,  -1.90516618e-01,
      1.45748370e-01,  -9.49464768e-02,   1.05989648e-01,
     -0.00000000e+00],
   [  1.61040296e-01,   3.47929944e-01,   1.19871598e-01,
      6.48965493e-01,   7.53188055e-02,   1.31730340e-01,
      1.33229858e-01,   1.43587499e-01,   2.20913989e-02,
     -0.00000000e+00],
   [  1.70017435e-01,  -4.22573148e-01,  -4.81511942e-01,
     -2.42170125e-01,  -1.18575764e-01,  -6.87250591e-02,
     -1.20660307e-01,   2.22865482e-01,   1.73666882e-02,
     -0.00000000e+00],
   [ -6.90841779e-02,   2.86233901e-01,   4.16612350e-01,
     -9.38935057e-03,   3.02325120e-01,  -1.61783482e-01,
     -3.55465509e-01,  -1.15323059e-02,   5.04619674e-02,
     -0.00000000e+00],
   [ -5.26189089e-01,  -6.81324113e-01,   2.89960115e-01,
     -2.01781673e-02,   3.03159463e-01,  -2.11777986e-01,
      2.25937548e-01,   5.49219872e-05,  -3.66268329e-02,
     -0.00000000e+00],
   [ -6.68680313e-02,   2.99715813e-01,  -8.53428694e-01,
      1.30066853e-01,   2.31410283e-01,  -1.02860624e-01,
      1.95449586e-02,  -1.30218425e-01,  -1.68059569e-02,
     -0.00000000e+00],
   [ -9.68303353e-01,  -4.80944309e-02,  -2.62865615e-02,
     -1.44821658e-01,  -1.47094421e-01,   3.07366196e-01,
      1.91849667e-02,  -5.08517759e-02,   1.03558238e-01,
     -0.00000000e+00]])

Upvotes: 3

Views: 1536

Answers (1)

Josef
Josef

Reputation: 22897

The sign and other normalization choices for eigenvectors are arbitrary. Matlab and numpy norm the eigenvectors in the same way, but the sign is arbitrary and can depend on details of the linear algebra library that is used.

When I wrote the numpy equivalent of matlab's princomp, then I just normalized the sign of the eigenvectors when I compared them to those of matlab in my unit tests.

Upvotes: 3

Related Questions