Reputation: 425
I have a large dataset of size 42.9 GB which are stored as numpy's compressed npz format. The data when loaded has
n_samples, n_features = 406762, 26421
I need to perform dimensionality reduction on this and hence using sklearn's PCA methods. Usually, I perform
from sklearn.decomposition import IncrementalPCA, PCA
pca = PCA(n_components=200).fit(x)
x_transformed = pca.transform(x)
Since the data can't be loaded into memory, I am using Incremental PCA as it has out-of-core support by providing partial_fit method.
from sklearn.decomposition import IncrementalPCA, PCA
ipca = IncrementalPCA(n_components=200)
for x in xrange(407):
partial_x = load("...")
ipca.partial_fit(partial_x)
Now, once the model is fit with complete data, how do I perform transform? As transform takes the entire data and there is no partial_transform method given.
Edit: #1
Once the reduced dimensional representation of the data is calculated, this is how I'm verifying the reconstruction error.
from sklearn.metrics import mean_squared_error
reconstructed_matrix = pca_model.inverse_transform(reduced_x)
error_curr = mean_square_error(reconstructed_x, x)
How do I calculate the error for the large dataset? Also, Is there a way I can use the partial_fit as part of the GridSearch or RandomizedSearch to find the best n_components?
Upvotes: 5
Views: 2533
Reputation: 4886
You can do it the same way you're fitting your model. The transform function doesn't have to be applied to the whole data at once.
x_transform = np.ndarray(shape=(0, 200))
for x in xrange(407):
partial_x = load("...")
partial_x_transform = ipca.transform(partial_x)
x_transform = np.vstack((x_transform, partial_x_transform))
To calculate the mean squared error for the reconstruction, you can use a code such as the following:
from sklearn.metrics import mean_squared_error
sum = 0
for i in xrange(407):
# with a custom get_segment function
partial_x_reduced = get_segment(x_reduced, i)
reconstructed_matrix = pca_model.inverse_transform(partial_reduced_x)
residual = mean_square_error(reconstructed_x, get_segment(x, i))
sum += residual * len(partial_x_reduced)
mse = sum / len(x_reduced)
For the parameter tuning, you can set the number of components to the maximum value you want, transform your input, and then in your grid search, only use the first k
columns, k
being your hyper-parameter. You don't have to recalculate the whole PCA each time you change k
.
Upvotes: 4