anon_swe
anon_swe

Reputation: 9335

SciKit-Learn: Basic PCA Confusion

I'm trying to use SciKit-Learn to perform PCA on my dataset. I currently have 2,208 rows and 53,741 columns (features). So I want to use PCA to reduce the dimensionality of this dataset.

I'm following Hands-On Machine Learning with SciKit-Learn and TensorFlow:

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)

As far as I understand, this should reduce the number of columns such that they, in total, explain 95% of the variance in my dataset.

Now I want to see how many features (columns) are left in X_reduced:

X_reduced.shape
(2208, 1)

So it looks like a single feature accounts for at least 95% of the variance in my dataset...

1) This is very surprising, so I looked at how much the most important dimension contributes variance-wise:

pca = PCA(n_components = 1)
X2D = pca.fit_transform(X)
print pca.explained_variance_ratio_

[ 0.98544046]

So it's 98.5%!

How do I figure out what this seemingly magical dimension is?

2) Don't I need to include my target Y values when doing PCA?

Thanks!

Upvotes: 2

Views: 859

Answers (1)

Ryan Stout
Ryan Stout

Reputation: 1028

This "seemingly magical dimension" is actually a linear combination of all your dimensions. PCA works by changing basis from your original column space to the space spanned by the eigenvectors of your data's covariance matrix. You don't need the Y-values because PCA only needs the eigenvalues and eigenvectors of your data's covariance matrix.

Upvotes: 2

Related Questions