astudentofmaths
astudentofmaths

Reputation: 1163

PCA sklearn - Which dimension does it take

Does sklearn PCA consider the columns of the dataframe as the vectors to reduce or the rows as vectors to reduce ?

Because when doing this:

df=pd.DataFrame([[1,-21,45,3,4],[4,5,89,-5,6],[7,-4,58,1,19]‌​,[10,11,74,20,12],[1‌​3,14,15,45,78]]) #5 rows 5 columns
pca=PCA(n_components=3)
pca.fit(df)
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)

I get the following error:

ValueError: Shape of passed values is (5, 3), indices imply (5, 5)

Upvotes: 0

Views: 1196

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36599

Rows represent samples and columns represent features. PCA reduces the dimensionality of the data, ie features. So columns.

So if you are talking about vectors, then it considers a row as single feature vector and reduces its size.

If you have a dataframe of shape say [100, 6] and PCA n_components is set to 3. So your output will be [100, 3].

# You need this
df_pcs=pca.transform(df)

# This produces error because shapes dont match.
df_pcs=pd.DataFrame(data=pca.components_, index = df.index)

pca.components_ is an array of [3,5] and your index parameter is using the df.index which is of shape [5,]. Hence the error. pca.components_ represents a completely different thing.

According to documentation:-

components_ : array, [n_components, n_features]

Principal axes in feature space, representing the 
directions of maximum variance in the data.

Upvotes: 3

Related Questions