PCAs and Feature correlation

Question

I have a dataset with about 200 columns/features all with numerical values and taking its corr() gives me values very close to 0 (like -0.0003 to +0.0003), and so by plotting its heatmap also gives me a big black box with white diagonal - I hope you get the picture. Anyway, here it is:

After this, when I try to perform PCA on the dataset, it doesn't really help, as there's barely any correlation between any two features. Did I assume right?

Here's the PCA code:

from sklearn.decomposition import PCA

pca = PCA(n_components = .99) # 99% of variance (selecting components while retaining 99% of the variability in data)
pca.fit(X_scaled)
X_PCA = pca.transform(X_scaled)

And here's the plot to determine the principle components (Elbow method):

Code for the above:

sns.set(rc = {'figure.figsize': (20, 10)})

plt.ylabel('Eigenvalues')
plt.xlabel('Number of features')
plt.title('Elbow method to determine the principle components')
plt.ylim(0, max(pca.explained_variance_))
plt.axhline(y = (max(pca.explained_variance_) + min(pca.explained_variance_))/2, color = 'r', linestyle = '--')
plt.plot(pca.explained_variance_)
plt.show()

What I was able to determine from the plot is that there's not really a way to get the principal components, except maybe at PC1, but that'd mean there's only one PC and that would be like discarding 99.5% of data, so I am assuming all the 200 features are necessary.

So my question boils down to this:

Is that the right assumption?
If not, what is an ideal way to deal with scenarios like this (where there are a lot of features and no correlations between most (or all) of them)?
Is the correlation between variables a deciding factor for PCA? I read somewhere it is.

PCAs and Feature correlation

Answers (1)

Related Questions