Very Large Values Predicted for Linear Regression

Question

I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.

I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:

non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)

train = train._get_numeric_data()
train.fillna(0, inplace = True)

non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)

test = test._get_numeric_data()
test.fillna(0, inplace = True)

feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']

lm = LinearRegression(normalize = False)
lm.fit(X, y)

import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))

The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars. Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed.

Also, if there is a better way to do this I'd be eager to hear about it.

David · Accepted Answer

I posted this at the stats site and Ami Tavory pointed out that the get_dummies should be run on the merged train and test dataframe to ensure that the same dummy variables were set up in both dataframes. This solved the issue.

Very Large Values Predicted for Linear Regression

Answers (2)

Related Questions