Reputation: 239
I'm performing linear regression on this dataset: archive.ics.uci.edu/ml/datasets/online+news+popularity
It contains various types of features - rates, binary, numbers etc.
I've tried using scikit-learn Normalizer, StandardScaler and PowerTransformer, but the've all resulted in worse results than without using them.
I'm using them like this:
from sklearn.preprocessing import StandardScaler
X = df.drop(columns=['url', 'shares'])
Y = df['shares']
transformer = StandardScaler().fit(X)
X_scaled = transformer.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
perform_linear_and_ridge_regression(X=X_scaled, Y=Y)
The function on the last line perform_linear_and_ridge_regression()
is correct for sure and is using GridSearchCV to determine the best hyperparameters.
Just to make sure I include the function as well:
def perform_linear_and_ridge_regression(X, Y):
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=10)
lin_reg_parameters = { 'fit_intercept': [True, False] }
lin_reg = GridSearchCV(LinearRegression(), lin_reg_parameters, cv=5)
lin_reg.fit(X=X_train, y=Y_train)
Y_pred = lin_reg.predict(X_test)
print('Linear regression MAE =', median_absolute_error(Y_test, Y_pred))
The results are surprising as all of them provide worse results:
Linear reg. on original data: MAE = 1620.510555135375
Linear reg. after using Normalizer: MAE = 1979.8525218964242
Linear reg. after using StandardScaler: MAE = 2915.024521207241
Linear reg. after using PowerScaler: MAE = 1663.7148884463259
Is this just a special case, where Standardization doesn't help, or am I doing something wrong?
EDIT: Even when I leave the binary features out, most of the transformers gives worse results.
Upvotes: 3
Views: 4396
Reputation: 1902
Your dataset has many categorical and ordinal features. You should take care of that first separately. Also, it seems like you are applying normalization on categorical variables too, which is completely wrong.
Here is nice-link, which explains how to handle categorical features for regression problem.
Upvotes: 1