Reputation: 1538
I have a dataset with about 200 columns/features all with numerical values and taking its corr()
gives me values very close to 0 (like -0.0003 to +0.0003), and so by plotting its heatmap also gives me a big black box with white diagonal - I hope you get the picture. Anyway, here it is:
After this, when I try to perform PCA on the dataset, it doesn't really help, as there's barely any correlation between any two features. Did I assume right?
Here's the PCA code:
from sklearn.decomposition import PCA
pca = PCA(n_components = .99) # 99% of variance (selecting components while retaining 99% of the variability in data)
pca.fit(X_scaled)
X_PCA = pca.transform(X_scaled)
And here's the plot to determine the principle components (Elbow method):
Code for the above:
sns.set(rc = {'figure.figsize': (20, 10)})
plt.ylabel('Eigenvalues')
plt.xlabel('Number of features')
plt.title('Elbow method to determine the principle components')
plt.ylim(0, max(pca.explained_variance_))
plt.axhline(y = (max(pca.explained_variance_) + min(pca.explained_variance_))/2, color = 'r', linestyle = '--')
plt.plot(pca.explained_variance_)
plt.show()
What I was able to determine from the plot is that there's not really a way to get the principal components, except maybe at PC1, but that'd mean there's only one PC and that would be like discarding 99.5% of data, so I am assuming all the 200 features are necessary.
So my question boils down to this:
Upvotes: 0
Views: 1024
Reputation: 75
For example: You can have 2 features which are not correlated at all, like feature_1 to be person's height and feature_2 to be today's weather. Those 2 features are not correlated but if our task is to guess a person's weight then weather will not be a necessary feature by common sense.
The way PCA works is that it first builds a covariance matrix, which is basically correlation between all possible pairs of features (it is a symmetric matrix as corr(x1,x2) is the same as cor(x2,x1)). So for example if we have 3 features, X1, X2 and X3, we will have a covariance matrix:
After building covariance matrix we calculate eigenvalues and eigenvectors which then gives us the explained variance and the vectors upon which we project the original data. To play around if you have time, what you can do is create a dummy dataset with some random values and call it X1. Then create a linear feature X2 (add some number to X1 or multiply X1 by some constant) and same for X3. Then do the regular sklearn PCA and you will see that the explained variance ratio of PCA with n_components=1 will be 1, resulting exactly in what we set up during generating X1, X2 and X3 (the fact that X2 and X3 are fully correlated). If what I wrote above is confusing I've included code for doing this at the end.
Depends on what you are trying to do. Do you want to reduce dimension of the data or are you planning on using these features for some model?
As stated in the first answer, yes it is, try the code below.
import numpy as np
from sklearn.decomposition import PCA
X1 = np.random.normal(0,1,100)
X2 = X1 + 5
X3 = X1 * 18
X = np.vstack([X1, X2, X3]).T
pca = PCA(n_components=1)
pca.fit_transform(X)
print("Explained variance ratio is ", pca.explained_variance_ratio_[0])
EDIT: Mistake: Covariance matrix has covariance between features inside the matrix instead of correlation. Correlation is dimensionless covariance so the main context of the answer stays the same.
Upvotes: 1