Reputation: 1
Using ggplot2's "economics" dataset to study linear regression. why is my SGDRegression graph looking like this?
x = df.loc[:,'pce'].values.reshape(-1,1)
y = df.loc[:,'psavert'].values.reshape(-1,1)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2)
from sklearn.linear_model import SGDRegressor
sr = SGDRegressor(eta0 = 0.001, verbose = 1)
sr.fit(x_train,y_train.flatten())
plt.scatter(x_train,y_train, s = 5, alpha = 0.3, c = 'blue')
plt.plot(x_train,sr.predict(x_train), c = 'green')
plt.xlabel('pce(billions)')
plt.ylabel('saving rate')
plt.show()
Tried ajusting the eta0, or the max_iter and it doesn't work.
Upvotes: 0
Views: 23
Reputation: 5010
In general you need to scale the input features when using gradient descent (SGDRegressor
uses gradient descent, whilst LinearRegression
does not).
You only have a single input feature, so I thought it would be okay to skip the scaling step. However, I think there is some instability arising from the large values of X, and I found that you need to either divide it down by 100, or use StandardScaler
as per the example below.
In short, I replaced SGDRegressor()
with a pipeline that first scales X before fitting the regressor:
sgd_reg = make_pipeline(
StandardScaler(),
SGDRegressor()
)
import pandas as pd
from matplotlib import pyplot as plt
#
# Load and visualise data
#
#Data from
# https://github.com/tidyverse/ggplot2/blob/main/data-raw/economics.csv
df = pd.read_csv('economics.csv')
display(df)
x_name = 'pce'
y_name = 'psavert'
#Visualise data
f, ax = plt.subplots(figsize=(5, 3), layout='tight')
df.plot.scatter(
x=x_name, y=y_name,
s=10, edgecolor='none', alpha=0.5, c='dodgerblue',
label='data', ax=ax
)
#
# Fit SGDRegressor
#
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
x = df[[x_name]].to_numpy()
y = df[y_name].to_numpy()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
#Compose a pipeline that first standardizes X before fitting
sgd_reg = make_pipeline(
StandardScaler(),
SGDRegressor()
)
sgd_reg.fit(x_train, y_train)
#Plot results
ax.plot(x_train, sgd_reg.predict(x_train), c='tab:red', linewidth=1, label='linear fit')
ax.legend()
ax.set(xlabel='PCE (billions)', ylabel='saving rate', title='Fit using SGDRegressor')
ax.spines[['top', 'right']].set_visible(False)
Upvotes: 0