Reputation: 1

SGDRegressor graph looking so different from LinearRegression

Using ggplot2's "economics" dataset to study linear regression. why is my SGDRegression graph looking like this?

x = df.loc[:,'pce'].values.reshape(-1,1)
y = df.loc[:,'psavert'].values.reshape(-1,1)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)
from sklearn.linear_model import SGDRegressor
sr = SGDRegressor(eta0 = 0.001, verbose = 1)
sr.fit(x_train,y_train.flatten())
plt.scatter(x_train,y_train, s = 5, alpha = 0.3, c = 'blue')
plt.plot(x_train,sr.predict(x_train), c = 'green')
plt.xlabel('pce(billions)')
plt.ylabel('saving rate')
plt.show()

SGD graph

LinearRegression graph

Tried ajusting the eta0, or the max_iter and it doesn't work.

Upvotes: 0

Answers (1)

MuhammedYunus

Reputation: 5010

In general you need to scale the input features when using gradient descent (SGDRegressor uses gradient descent, whilst LinearRegression does not).

You only have a single input feature, so I thought it would be okay to skip the scaling step. However, I think there is some instability arising from the large values of X, and I found that you need to either divide it down by 100, or use StandardScaler as per the example below.

In short, I replaced SGDRegressor() with a pipeline that first scales X before fitting the regressor:

sgd_reg = make_pipeline(
    StandardScaler(),
    SGDRegressor()
)

import pandas as pd
from matplotlib import pyplot as plt

#
# Load and visualise data
#

#Data from
# https://github.com/tidyverse/ggplot2/blob/main/data-raw/economics.csv
df = pd.read_csv('economics.csv')
display(df)

x_name = 'pce'
y_name = 'psavert'

#Visualise data
f, ax = plt.subplots(figsize=(5, 3), layout='tight')

df.plot.scatter(
    x=x_name, y=y_name,
    s=10, edgecolor='none', alpha=0.5, c='dodgerblue',
    label='data', ax=ax
)

#
# Fit SGDRegressor
#
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

x = df[[x_name]].to_numpy()
y = df[y_name].to_numpy()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#Compose a pipeline that first standardizes X before fitting
sgd_reg = make_pipeline(
    StandardScaler(),
    SGDRegressor()
)
sgd_reg.fit(x_train, y_train)

#Plot results
ax.plot(x_train, sgd_reg.predict(x_train), c='tab:red', linewidth=1, label='linear fit')

ax.legend()
ax.set(xlabel='PCE (billions)', ylabel='saving rate', title='Fit using SGDRegressor')
ax.spines[['top', 'right']].set_visible(False)

Upvotes: 0

SGDRegressor graph looking so different from LinearRegression

Answers (1)

Related Questions