Sklearn PCA: Correct Dimensionality of PCs

Question

I have a dataframe, df, which contains a column called 'event' wherein there is a 24x24x40 numpy array. I want to:

extract this numpy array;
flatten it into a 1x23040 vector;
add this entry as a column in a new numpy array or dataframe;
perform PCA on the resulting matrix.

However, the PCA produces eigenvectors with the dimensions of 'the number of entries', not the 'number of dimensions in the data'.

To illustrate my problem, I demonstrate a minimal example that works perfectly well:

EXAMPLE 1

from sklearn import datasets, decomposition

digits = datasets.load_digits()
X = digits.data

pca = decomposition.PCA()
X_pca = pca.fit_transform(X)

print (X.shape)
Result: (1797, 64)

print (X_pca.shape)
Result: (1797, 64)

There are 1797 entries in each case, with eigenvectors of dimension 64.

Now onto my example:

EXAMPLE 2

 from sklearn import datasets, decomposition
 import pandas as pd
 hdf=pd.HDFStore('./afile.h5')
 df=hdf.select('batch0')

 print(df['event'][0].shape)
 Result: (1, 24, 24, 40)

 print(df['event'][0].shape.flatten())
 Result: (23040,)

 for index, row in df.iterrows():
        entry = df['event'][index].flatten()
        _list.append(entry)


 X = np.asarray(_list)
 pca = decomposition.PCA()
 X_pca=pca.fit_transform(X)

 print (X.shape)
 Result: (201, 23040)
 print (X_pca.shape)
 Result:(201, 201)

This has dimensions of the number of data, 201 entries!

I am unfamiliar with dataframes, so it could be that I am iterating through the dataframe incorrectly. However, I have checked that the rows of the resultant numpy array in X in Example 2 can be reshaped and plotted as expected.

Any thoughts would be appreciated!

Kind regards!

Gustavo Fonseca · Accepted Answer

Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features).

Now, heading to your example:

In your first example, the number of data samples 1797 is less than the number of dimensions 64, therefore it keeps the whole dimensionality (since you are not specifying the number of components). However, in your second example, the number of data samples is far less than the number of features, hence, sklearns' PCA reduces the number of dimensions to n_samples.

Sklearn PCA: Correct Dimensionality of PCs

Answers (1)

Related Questions