Anonymous Person
Anonymous Person

Reputation: 1538

PCAs and Feature correlation

I have a dataset with about 200 columns/features all with numerical values and taking its corr() gives me values very close to 0 (like -0.0003 to +0.0003), and so by plotting its heatmap also gives me a big black box with white diagonal - I hope you get the picture. Anyway, here it is:

enter image description here

After this, when I try to perform PCA on the dataset, it doesn't really help, as there's barely any correlation between any two features. Did I assume right?

Here's the PCA code:

from sklearn.decomposition import PCA

pca = PCA(n_components = .99) # 99% of variance (selecting components while retaining 99% of the variability in data)
pca.fit(X_scaled)
X_PCA = pca.transform(X_scaled)

And here's the plot to determine the principle components (Elbow method):

enter image description here

Code for the above:

sns.set(rc = {'figure.figsize': (20, 10)})

plt.ylabel('Eigenvalues')
plt.xlabel('Number of features')
plt.title('Elbow method to determine the principle components')
plt.ylim(0, max(pca.explained_variance_))
plt.axhline(y = (max(pca.explained_variance_) + min(pca.explained_variance_))/2, color = 'r', linestyle = '--')
plt.plot(pca.explained_variance_)
plt.show()

What I was able to determine from the plot is that there's not really a way to get the principal components, except maybe at PC1, but that'd mean there's only one PC and that would be like discarding 99.5% of data, so I am assuming all the 200 features are necessary.

So my question boils down to this:

  1. Is that the right assumption?
  2. If not, what is an ideal way to deal with scenarios like this (where there are a lot of features and no correlations between most (or all) of them)?
  3. Is the correlation between variables a deciding factor for PCA? I read somewhere it is.

Upvotes: 0

Views: 1024

Answers (1)

Irakli Salia
Irakli Salia

Reputation: 75

  1. The one thing you can take from this result is that those 200 features are not correlated (unless you forgot to mean normalise your data which is must for PCA). Whether those 200 features are necessary or not depends on what task you have.

For example: You can have 2 features which are not correlated at all, like feature_1 to be person's height and feature_2 to be today's weather. Those 2 features are not correlated but if our task is to guess a person's weight then weather will not be a necessary feature by common sense.

The way PCA works is that it first builds a covariance matrix, which is basically correlation between all possible pairs of features (it is a symmetric matrix as corr(x1,x2) is the same as cor(x2,x1)). So for example if we have 3 features, X1, X2 and X3, we will have a covariance matrix:

Covariance matrix

After building covariance matrix we calculate eigenvalues and eigenvectors which then gives us the explained variance and the vectors upon which we project the original data. To play around if you have time, what you can do is create a dummy dataset with some random values and call it X1. Then create a linear feature X2 (add some number to X1 or multiply X1 by some constant) and same for X3. Then do the regular sklearn PCA and you will see that the explained variance ratio of PCA with n_components=1 will be 1, resulting exactly in what we set up during generating X1, X2 and X3 (the fact that X2 and X3 are fully correlated). If what I wrote above is confusing I've included code for doing this at the end.

  1. Depends on what you are trying to do. Do you want to reduce dimension of the data or are you planning on using these features for some model?

  2. As stated in the first answer, yes it is, try the code below.


import numpy as np
from sklearn.decomposition import PCA

X1 = np.random.normal(0,1,100)
X2 = X1 + 5
X3 = X1 * 18
X = np.vstack([X1, X2, X3]).T

pca = PCA(n_components=1)

pca.fit_transform(X)

print("Explained variance ratio is ", pca.explained_variance_ratio_[0])

EDIT: Mistake: Covariance matrix has covariance between features inside the matrix instead of correlation. Correlation is dimensionless covariance so the main context of the answer stays the same.

Upvotes: 1

Related Questions