Reputation: 21
I'm trying to train a linear regression model. With GridSearchCV I want to investigate how the model performs with different numbers of dimensions after PCA. I also found a sklearn tutorial which does pretty much the same thing.
But first, my Code:
import pandas as pd
import sklearn.linear_model as skl_linear_model
import sklearn.pipeline as skl_pipeline
import sklearn.model_selection as skl_model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
model_lr = skl_linear_model.LinearRegression()
pca_lr = PCA()
pipeline = skl_pipeline.Pipeline([
('standardize', StandardScaler()),
('reduce_dim', pca_lr),
('regressor', model_lr)])
n_components = list(range(1, len(X_train.columns)+1))
param_grid_lr = {'reduce_dim__n_components': n_components}
estimator_lr = skl_model_selection.GridSearchCV(
pipeline,
param_grid_lr,
scoring='neg_root_mean_squared_error',
n_jobs=2,
cv=skl_model_selection.KFold(n_splits=25, shuffle=False, random_state=None),
error_score=0,
verbose=1,
refit=True)
estimator_lr.fit(X_train, y_train)
grid_results_lr = pd.DataFrame(estimator_lr.cv_results_)
By the way, my training data are measurements in different units in the shape of a 8548x7 array. The Code seems to work so far, those are the cv_results. For the complexity of the problem the result is ok for linear regression (I'm also using other models which perform better).
If I understand this correctly, the image shows, that Principal Component 1 and 2 should explain the main part of the data since with those two the loss is almost minimized. Adding more Principal Components doesn't really improve the result, so their contribution to explained variance is probably rather low.
To prove this, I manually did a PCA, and this is were confusion kicks in:
X_train_scaled = StandardScaler().fit_transform(X_train)
pca = PCA()
PC_list = []
for i in range(1,len(X_train.columns)+1): PC_list.append(''.join('PC'+str(i)))
PC_df = pd.DataFrame(data=pca.fit_transform(X_train_scaled), columns=PC_list)
PC_loadings_df = pd.DataFrame(pca.components_.T,
columns=PC_list,
index=X_train.columns.values.tolist())
PC_var_df = pd.DataFrame(data=pca.explained_variance_ratio_,
columns=['explained_var'],
index=PC_list)
That's the explained variance ratio.
This seemed a little unexpected, so I checked the tutorial I mentioned at the beginning. And if I don't overlook something, the person was doing pretty much the same, except one thing:
When fitting the PCA they did not scale the data, even though they used a StandardScaler in their pipeline. Anyway the results they are getting are looking just fine.
So I tried the same and without standardization the explained variance ratio looks like this. It seems like this would explain my cv_results way better since PC 1 and 2 explain over 90 % of the variance.
But I'm wondering why they didn't scale the data before PCA. Every info I find about PCA says that the input need to be standardized. And this makes sense, since the data I have are measurements in different units.
So what am I missing? Is my initial approach actually correct and I just misinterpret the results? Is it possible that the first two Principal Components almost minimize the loss, even though they explain only around 50 % of the variance? Or could it even be, that the PCA in the pipeline does not actually scale the data, which is why the results of the CV correlate better with the not-standardized manual PCA?
Upvotes: 2
Views: 3811
Reputation: 28044
I did not check correctness of the code, but only read the text and looked at the graphs. I will assume your analysis is correct.
I will only attempt to address
But I'm wondering why they didn't scale the data before PCA
and I advise to take this with a grain of salt, as I came to think about this same question a while back, and this is what I came up with. I have no reference for the following.
You should scale the data if
You should not scale the data if
It seems the last point is the case in the tutorial - 8x8 digits are really a 64-channel sensor. Each pixel is already normalized in the sensor (since the dataset is assumed to be clean, I believe).
PCA won't work if
it is not hard to find examples when PCA doesn't work. It is only a linear model, after all.
This doesn't say what you should do with your own 8548x7 data. just by the shape, I am assuming you should normalize in that case.
I hope this gives some inspiration for further thinking.
Let me add a sidenote on not scaling images: Multiple images can be seen as taken by different sensors, due to lighting, depth, or other effects which can change between images. In the case of 8x8 scans, of a testing database, this is unlikely.
Upvotes: 6