Jack Rolph
Jack Rolph

Reputation: 595

Sklearn PCA: Correct Dimensionality of PCs

I have a dataframe, df, which contains a column called 'event' wherein there is a 24x24x40 numpy array. I want to:

However, the PCA produces eigenvectors with the dimensions of 'the number of entries', not the 'number of dimensions in the data'.

To illustrate my problem, I demonstrate a minimal example that works perfectly well:

EXAMPLE 1

from sklearn import datasets, decomposition

digits = datasets.load_digits()
X = digits.data

pca = decomposition.PCA()
X_pca = pca.fit_transform(X)

print (X.shape)
Result: (1797, 64)

print (X_pca.shape)
Result: (1797, 64)

There are 1797 entries in each case, with eigenvectors of dimension 64.

Now onto my example:

EXAMPLE 2

 from sklearn import datasets, decomposition
 import pandas as pd
 hdf=pd.HDFStore('./afile.h5')
 df=hdf.select('batch0')

 print(df['event'][0].shape)
 Result: (1, 24, 24, 40)

 print(df['event'][0].shape.flatten())
 Result: (23040,)

 for index, row in df.iterrows():
        entry = df['event'][index].flatten()
        _list.append(entry)


 X = np.asarray(_list)
 pca = decomposition.PCA()
 X_pca=pca.fit_transform(X)

 print (X.shape)
 Result: (201, 23040)
 print (X_pca.shape)
 Result:(201, 201)

This has dimensions of the number of data, 201 entries!

I am unfamiliar with dataframes, so it could be that I am iterating through the dataframe incorrectly. However, I have checked that the rows of the resultant numpy array in X in Example 2 can be reshaped and plotted as expected.

Any thoughts would be appreciated!

Kind regards!

Upvotes: 0

Views: 126

Answers (1)

Gustavo Fonseca
Gustavo Fonseca

Reputation: 651

Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features).

Now, heading to your example:

In your first example, the number of data samples 1797 is less than the number of dimensions 64, therefore it keeps the whole dimensionality (since you are not specifying the number of components). However, in your second example, the number of data samples is far less than the number of features, hence, sklearns' PCA reduces the number of dimensions to n_samples.

Upvotes: 1

Related Questions