Linear Regression test data violating training data.Please explain where i went wrong

Question

This is a part of a dataset containing 1000 entries of pricing of rents of houses at different locations.

after training the model, if i send same training data as test data, i am getting incorrect results. How is this even possible?

X_loc = df[{'area','rooms','location'}]

y_loc = df[:]['price']

X_train, X_test, y_train, y_test = train_test_split(X_loc, y_loc, test_size = 1/3, random_state = 0)

regressor = LinearRegression()

regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_train[0:1])

DATASET:

    price rooms  area location

0 0 22000   3    1339   140

1 1 45000   3    1580    72

3 3 72000   3    2310    72

4 4 40000   3    1800    41

5 5 35000   3    2100    57

expected output (y_pred)should be 220000 but its showing 290000 How can it violate the already trained input?

Awadelrahman M. A. Ahmed · Accepted Answer

What you observed is exactly what is referred to as the "training error". Machine learning models are meant to find the "best" fit which minimizes the "total error" (i.e. for all data points and not every data point). 22000 is not very far from 29000, although it is not the exact number. This because linear regression tries compress all the variations in your data to follow one straight line.

Linear Regression test data violating training data.Please explain where i went wrong

Answers (2)

Related Questions