Reputation: 115
I'm currently trying to fit a Gaussian Process model to my data and have it predict some days ahead. I have reduced my ~10 features down to just 2 components via PCA in sklearn. So now I have PCA1 and PCA2. This was obtained by performing PCA on the training set (40%).
pca = PCA(n_components=2)
pca.fit(train_data)
PCAs = pca.transform(train_data)
PCA1 = PCAs[:,0]
PCA2 = PCAs[:,1]
where train_data
is the dataframe with ~10 features and 50 rows and StandardScaler()
applied to it.
kernel = RBF()
model = gaussian_process.GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=10)
model.fit(x_days_train, PCA1)
y_pred, y_std = model.predict(x_days, return_std=True)
model.score(x_days_train, PCA1)
where x_days
if the full 50 days, and x_days_train
is 20 days (0,1,2....). I get a score of 1.0. However, my predicted results looks terrible (as per below). It's like after the training data, it just falls and then stagnates.
Not entirely sure what went wrong, but a couple guesses:
fit_transform
)?Would appreciate any help, thank you.
Upvotes: 3
Views: 637
Reputation: 36
Since my data has no target variables, I used PCA on all the features in the dataframe and they are supposed to be x variables? And then I used them as a y variable (by predicting). Maybe this is an incorrect approach?
You are correct. PCA is meant to transform high dimensional data into much smaller dimensions. Essentially the data is compressed but still contains the same information relative to each element in the data. Sci-kit learns transform function does not accept y variable. Instead use the fit_transform() function which accepts both variables applying the correct methods to the x variable and ignores the y.
Following that, can PCA even be used as y_prediction?
PCA is only transforming the data, Gaussian Process Regression (GPR) is making predictions.
Am I supposed to apply PCA to not just the training data, but also to the test data (apply fit_transform)?
Yes.
I seem to be only using PCA1 and not PCA2 (nor a combination of the two). Should I use both? If so, how?
After using the fit_transform() method like this:
pca_x, pca_y = pca.fit_transform(train_data)
Apply the data like this:
kernel = RBF()
model = gaussian_process.GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=10)
model.fit(pca_x, pca_y)
Here are the Sci-kit Learn user guides for PCA and GPR.
Upvotes: 1