adamlowlife
adamlowlife

Reputation: 239

Linear regression gives worse results after normalization or standardization

I'm performing linear regression on this dataset: archive.ics.uci.edu/ml/datasets/online+news+popularity

It contains various types of features - rates, binary, numbers etc.

I've tried using scikit-learn Normalizer, StandardScaler and PowerTransformer, but the've all resulted in worse results than without using them.

I'm using them like this:

from sklearn.preprocessing import StandardScaler
X = df.drop(columns=['url', 'shares'])
Y = df['shares']
transformer = StandardScaler().fit(X)
X_scaled = transformer.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
perform_linear_and_ridge_regression(X=X_scaled, Y=Y)

The function on the last line perform_linear_and_ridge_regression() is correct for sure and is using GridSearchCV to determine the best hyperparameters.

Just to make sure I include the function as well:

def perform_linear_and_ridge_regression(X, Y):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=10) 
    lin_reg_parameters = { 'fit_intercept': [True, False] }
    lin_reg = GridSearchCV(LinearRegression(), lin_reg_parameters, cv=5)
    lin_reg.fit(X=X_train, y=Y_train)
    Y_pred = lin_reg.predict(X_test)
    print('Linear regression MAE =', median_absolute_error(Y_test, Y_pred))

The results are surprising as all of them provide worse results:

Linear reg. on original data: MAE = 1620.510555135375

Linear reg. after using Normalizer: MAE = 1979.8525218964242

Linear reg. after using StandardScaler: MAE = 2915.024521207241

Linear reg. after using PowerScaler: MAE = 1663.7148884463259

Is this just a special case, where Standardization doesn't help, or am I doing something wrong?

EDIT: Even when I leave the binary features out, most of the transformers gives worse results.

Upvotes: 3

Views: 4396

Answers (1)

Ankish Bansal
Ankish Bansal

Reputation: 1902

Your dataset has many categorical and ordinal features. You should take care of that first separately. Also, it seems like you are applying normalization on categorical variables too, which is completely wrong.

Here is nice-link, which explains how to handle categorical features for regression problem.

Upvotes: 1

Related Questions