Confused about standardization before PCA

Question

I'm trying to train a linear regression model. With GridSearchCV I want to investigate how the model performs with different numbers of dimensions after PCA. I also found a sklearn tutorial which does pretty much the same thing.

But first, my Code:

import pandas as pd
import sklearn.linear_model as skl_linear_model
import sklearn.pipeline as skl_pipeline
import sklearn.model_selection as skl_model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

model_lr = skl_linear_model.LinearRegression()

pca_lr = PCA()

pipeline = skl_pipeline.Pipeline([
            ('standardize', StandardScaler()),
            ('reduce_dim', pca_lr), 
            ('regressor', model_lr)])

n_components = list(range(1, len(X_train.columns)+1))
param_grid_lr = {'reduce_dim__n_components': n_components}

estimator_lr = skl_model_selection.GridSearchCV(
                pipeline,
                param_grid_lr,
                scoring='neg_root_mean_squared_error',
                n_jobs=2,
                cv=skl_model_selection.KFold(n_splits=25, shuffle=False, random_state=None),
                error_score=0,
                verbose=1,
                refit=True)

estimator_lr.fit(X_train, y_train)
grid_results_lr = pd.DataFrame(estimator_lr.cv_results_)

By the way, my training data are measurements in different units in the shape of a 8548x7 array. The Code seems to work so far, those are the cv_results. For the complexity of the problem the result is ok for linear regression (I'm also using other models which perform better).

If I understand this correctly, the image shows, that Principal Component 1 and 2 should explain the main part of the data since with those two the loss is almost minimized. Adding more Principal Components doesn't really improve the result, so their contribution to explained variance is probably rather low.

To prove this, I manually did a PCA, and this is were confusion kicks in:

X_train_scaled = StandardScaler().fit_transform(X_train)

pca = PCA()

PC_list = []
for i in range(1,len(X_train.columns)+1): PC_list.append(''.join('PC'+str(i)))

PC_df = pd.DataFrame(data=pca.fit_transform(X_train_scaled), columns=PC_list)

PC_loadings_df = pd.DataFrame(pca.components_.T,
                            columns=PC_list,
                            index=X_train.columns.values.tolist())

PC_var_df = pd.DataFrame(data=pca.explained_variance_ratio_,
                         columns=['explained_var'],
                         index=PC_list)

That's the explained variance ratio.

This seemed a little unexpected, so I checked the tutorial I mentioned at the beginning. And if I don't overlook something, the person was doing pretty much the same, except one thing:

When fitting the PCA they did not scale the data, even though they used a StandardScaler in their pipeline. Anyway the results they are getting are looking just fine.

So I tried the same and without standardization the explained variance ratio looks like this. It seems like this would explain my cv_results way better since PC 1 and 2 explain over 90 % of the variance.

But I'm wondering why they didn't scale the data before PCA. Every info I find about PCA says that the input need to be standardized. And this makes sense, since the data I have are measurements in different units.

So what am I missing? Is my initial approach actually correct and I just misinterpret the results? Is it possible that the first two Principal Components almost minimize the loss, even though they explain only around 50 % of the variance? Or could it even be, that the PCA in the pipeline does not actually scale the data, which is why the results of the CV correlate better with the not-standardized manual PCA?

Confused about standardization before PCA

Answers (1)

When should you, or shouldn't you scale the data?

Related Questions