David
David

Reputation: 1438

Very Large Values Predicted for Linear Regression

I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.

I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:

non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)

train = train._get_numeric_data()
train.fillna(0, inplace = True)

non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)

test = test._get_numeric_data()
test.fillna(0, inplace = True)

feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']

lm = LinearRegression(normalize = False)
lm.fit(X, y)

import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))

The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars. Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed.

Also, if there is a better way to do this I'd be eager to hear about it.

Upvotes: 1

Views: 3772

Answers (2)

David
David

Reputation: 1438

I posted this at the stats site and Ami Tavory pointed out that the get_dummies should be run on the merged train and test dataframe to ensure that the same dummy variables were set up in both dataframes. This solved the issue.

Upvotes: 1

Siva-Sg
Siva-Sg

Reputation: 2821

You seem to encounter collinearity due to introduction of categorical variables in feature column, since sum of the feature columns of "one-hot" encoded variables is always 1.

If you have one categorical variable , you need to set "fit_intercept=False" in your linear Regression (or drop one of the feature column of one-hot coded variable)

If you have more than one categorical variables, you need to drop one feature column for each of the category to break collinearity.

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
In [72]:

df = pd.read_csv('/home/siva/anaconda3/data.csv')
df
Out[72]:
C1  C2  C3  y
0   1   0   0   12.4
1   1   0   0   11.9
2   0   1   0   8.3
3   0   1   0   3.1
4   0   0   1   5.4
5   0   0   1   6.2
In [73]:

y
X = df.iloc[:,0:3]
y = df.iloc[:,-1]
In [74]:

reg = LinearRegression()
reg.fit(X,y)
Out[74]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [75]:

_
reg.coef_,reg.intercept_
Out[75]:
(array([ 4.26666667, -2.18333333, -2.08333333]), 7.8833333333333346)
we find that co_efficients for C1, C2 , C3 do not make sense according to given X.
In [76]:

reg1 = LinearRegression(fit_intercept=False)
reg1.fit(X,y)
Out[76]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)
In [77]:

reg1.coef_
Out[77]:
array([ 12.15,   5.7 ,   5.8 ])
we find that co_efficients makes much more sense when the fit_intercept was set to False

A detailed explanation for a similar question at below.

https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn

Upvotes: 2

Related Questions