Reputation: 19
I am trying to calculate X1^2+X2^2=Y using multiple regression in Phyton. In the CSV file, I have 2 columns X1 and X2 those are random numbers between 1 and 60. I want to predict y values of the test data. But the error of my model is too high.
df = pd.read_csv("C:/Users/Büşra/Desktop/bitirme1/square-test.csv",sep=';')
x = df[['X1','X2']]
y = df[['Y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,random_state=1)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
model1 = linear_model.LinearRegression()
model1.fit(x_train, y_train)
print('Intercept: \n', model1.intercept_)
print('Coefficients: \n', model1.coef_)
print("Accuracy: %f" % model1.score(x_train,y_train))
y_pred = abs(model1.predict(x_test))
print('Mean Absolute Error:',(mean_absolute_error(y_test.to_numpy(), y_pred)))
print('Mean Squared Error:', (metrics.mean_squared_error(y_test.to_numpy(), y_pred)) )
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test.to_numpy(), y_pred)))
Mean Absolute Error: 297.7286734942946
Mean Squared Error: 129653.26345373654
Root Mean Squared Error: 360.0739694198076
Upvotes: 0
Views: 1200
Reputation: 3485
The predictive power of your model is exactly what I'd expect from a linear regression trained on random data as you describe.
Below I train an Ordinary Least Squares linear regression on 10,000 pairs of random x1
's and x2
's where 0 <= x <= 60
, and y = x1**2 + x2**2
. I then test it on 100 random pairs.
import numpy as np
import sklearn.linear_model
X_train = np.random.rand(20000).reshape(10000,2)*60
y_train = (X_train[:, 0]**2)+(X_train[:, 1]**2)
X_test = np.random.rand(200).reshape(100,2)*60
y_test = (X_test[:, 0]**2)+(X_test[:, 1]**2)
model = sklearn.linear_model.LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MAE: {}".format(np.abs(y_test-y_pred).mean()))
print("RMSE: {}".format(np.sqrt(((y_test-y_pred)**2).mean())))
It gives me exactly the same errors as it gives you.
>>> python .\regression.py
MAE: 301.35977152696194
RMSE: 363.663670758086
Here is a plot illustrating why the regression cannot obtain better results than this. The features (x1
and x1
) are on the x and y axes, and the target (y
) is on the z-axis. The red dots are the training samples and the blue plane is the function that the regression produces.
A Linear Regression can only produce a function of the form y = w1·x1 + w2·x2 + w3
where w1
, w2
and w3
are the weights being optimised by the regression. This type of function generates a flat plane, like the one shown. In this case the equation fit is y = -1249.41 + 61.18x1 + 60.69x2
. This is clearly not the same type of function that generated the samples, which follow a nice curved surface.
The effect is much clearer if you run the code yourself so that you can move the 3D plot around and more easily see the shapes.
Upvotes: 2
Reputation: 51037
As I understand it, you are looking for a model of the form y = a*x_1 + b*x_2 + c
to approximate the function y = x_1**2 + x_2**2
by linear regression. If your variables x_1
and x_2
are drawn uniformly at random from the range 0-60, the mean squared error over this range is exactly
This is minimized when a = 60, b = 60 and c = -1200, so this is the best theoretically possible linear model, and your model should converge to it as it is trained on more data. This model has an MSE of 144,000 and an RMSE of 379.473. This about matches your model, so it looks like there is no problem with your results.
Your RMSE could be slightly lower than the "theoretically best RMSE" because it is measured over a sample rather than the whole uniform distribution. You should also get slightly different results for the range 1-60, or if your data only contains integers, or so on.
Upvotes: 1