MasterEND
MasterEND

Reputation: 37

Linear Regression test data violating training data.Please explain where i went wrong

This is a part of a dataset containing 1000 entries of pricing of rents of houses at different locations.

after training the model, if i send same training data as test data, i am getting incorrect results. How is this even possible?

X_loc = df[{'area','rooms','location'}]

y_loc = df[:]['price']

X_train, X_test, y_train, y_test = train_test_split(X_loc, y_loc, test_size = 1/3, random_state = 0)

regressor = LinearRegression()

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_train[0:1])

DATASET:

    price rooms  area location

0 0 22000   3    1339   140

1 1 45000   3    1580    72

3 3 72000   3    2310    72

4 4 40000   3    1800    41

5 5 35000   3    2100    57

expected output (y_pred)should be 220000 but its showing 290000 How can it violate the already trained input?

Upvotes: 1

Views: 48

Answers (2)

What you observed is exactly what is referred to as the "training error". Machine learning models are meant to find the "best" fit which minimizes the "total error" (i.e. for all data points and not every data point). 22000 is not very far from 29000, although it is not the exact number. This because linear regression tries compress all the variations in your data to follow one straight line.

Upvotes: 1

ralf htp
ralf htp

Reputation: 9422

Possibly the model is nonlinear and so applying a Linear Regression yields bad results. There are other reasons why a Linear Regression may fail cf. https://stats.stackexchange.com/questions/393706/bad-linear-regression-results Nonlinear data often appears when there are (statistical) interactions between features.

A generalization of Linear Regression is the Generalized Linear Model (GLM), that is able to handle nonlinearities by its nonlinear link functions : https://en.wikipedia.org/wiki/Generalized_linear_model

In scikit-learn you can use a Support Vector Regression with polynomial or RBF kernel for a nonlinear model https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html

An alternative ansatz is to analyze the data on interactions and apply methods that are described in https://en.wikipedia.org/wiki/Generalized_linear_model#Correlated_or_clustered_data however this is complex. Possibly try Ridge Regression for this assumption because it can handle multicollinearity tht is one form of statistical interactions: https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf

https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/

Upvotes: 1

Related Questions