Reputation: 27
So I've made a Linear Regression model on this dataset: https://www.kaggle.com/shree1992/housedata
After my cleanup and I build my model I get some crazy high coefficients which I did not expect.
I Googled on this problem and based on this I did a ridge regression which did fix the crazy coefficients but the score and MAE is almost the same (Linear regression scores best though in both MAE + score) which indicates that it's not due to overfitting so why am I getting these high coefficients and how do I explain/interpret them? Thanks in advance.. Below is my coefficients and code.
Coefficients:
sqft_living :: -20531660933516.066
floors :: -46157.99116169465
bedrooms :: -35148.64994889144
yr_built :: -110.275390625
sqft_lot :: -0.01336842838432517
yr_renovated :: 13901.669921875
bathrooms :: 22068.444163259817
condition :: 28854.36132510344
view :: 54609.32181396632
waterfront :: 619987.8770517551
statezip_WA 98070 :: 51720518.26940918
statezip_WA 98023 :: 51733793.98413086
statezip_WA 98198 :: 51745527.19320679
statezip_WA 98092 :: 51753612.19506836
statezip_WA 98003 :: 51768969.80859375
statezip_WA 98057 :: 51774754.2020874
statezip_WA 98032 :: 51777293.54980469
statezip_WA 98188 :: 51780926.42871094
statezip_WA 98022 :: 51785464.6875
statezip_WA 98042 :: 51788032.485961914
statezip_WA 98001 :: 51798657.185058594
statezip_WA 98030 :: 51800982.91894531
statezip_WA 98002 :: 51807063.37084961
statezip_WA 98038 :: 51818086.75805664
statezip_WA 98058 :: 51818726.060058594
statezip_WA 98031 :: 51820966.17700195
statezip_WA 98055 :: 51836975.10852051
statezip_WA 98178 :: 51839662.78881836
statezip_WA 98059 :: 51845304.94116211
statezip_WA 98019 :: 51849298.035583496
statezip_WA 98065 :: 51858962.752441406
statezip_WA 98014 :: 51862571.193847656
statezip_WA 98148 :: 51872288.3659668
statezip_WA 98166 :: 51878712.109375
statezip_WA 98056 :: 51890492.997558594
statezip_WA 98045 :: 51890671.47558594
statezip_WA 98168 :: 51909556.58944702
statezip_WA 98146 :: 51923932.966308594
statezip_WA 98011 :: 51925708.75717163
statezip_WA 98028 :: 51930531.6730957
statezip_WA 98155 :: 51933038.31750488
statezip_WA 98024 :: 51933207.13555908
statezip_WA 98108 :: 51935337.22363281
statezip_WA 98077 :: 51937928.41999817
statezip_WA 98072 :: 51939094.63574219
statezip_WA 98106 :: 51946079.88293457
statezip_WA 98027 :: 51954189.55102539
statezip_WA 98133 :: 51968441.83276367
statezip_WA 98118 :: 51972078.98779297
statezip_WA 98074 :: 51972640.670410156
statezip_WA 98125 :: 51985392.0078125
statezip_WA 98034 :: 51989931.86279297
statezip_WA 98053 :: 51994949.201171875
statezip_WA 98075 :: 51996895.56713867
statezip_WA 98126 :: 52003476.768066406
statezip_WA 98008 :: 52019588.31152344
statezip_WA 98029 :: 52033227.60961914
statezip_WA 98177 :: 52044918.458618164
statezip_WA 98136 :: 52054739.052734375
statezip_WA 98052 :: 52055053.704589844
statezip_WA 98006 :: 52077050.865234375
statezip_WA 98007 :: 52084987.728515625
statezip_WA 98144 :: 52104137.84765625
statezip_WA 98116 :: 52123261.3046875
statezip_WA 98033 :: 52128846.232666016
statezip_WA 98115 :: 52137801.478027344
statezip_WA 98117 :: 52140383.259521484
statezip_WA 98005 :: 52147522.69140625
statezip_WA 98122 :: 52159159.841552734
statezip_WA 98103 :: 52160013.99584961
statezip_WA 98107 :: 52176913.24609375
statezip_WA 98199 :: 52218928.334228516
statezip_WA 98102 :: 52277970.43017578
statezip_WA 98040 :: 52319189.98120117
statezip_WA 98119 :: 52323874.4597168
statezip_WA 98105 :: 52360431.115722656
statezip_WA 98109 :: 52381532.43066406
statezip_WA 98112 :: 52410056.1015625
statezip_WA 98004 :: 52665837.48083496
statezip_WA 98039 :: 52891510.521728516
sqft_basement :: 20531660933682.504
sqft_above :: 20531660933785.93
Code
houses_preprocessed = houses[
(houses.price<1.2*10**7) &
(houses.bedrooms>0) &
(houses.bedrooms <= 6) &
(houses.bathrooms>0) &
(houses.price>8000)].drop(columns=['country', 'date', 'street', 'city'])
houses_preprocessed.loc[houses_preprocessed['yr_renovated'] < 1, 'yr_renovated'] = 0
houses_preprocessed.loc[houses_preprocessed['yr_renovated'] > 1, 'yr_renovated'] = 1
toremove = houses_preprocessed['statezip'].value_counts()
houses_preprocessed=houses_preprocessed[houses_preprocessed.isin(toremove.index[toremove > 10]).values]
X = houses_preprocessed.drop(columns=['price'])
y = houses_preprocessed['price']
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
reg = LinearRegression()
reg.fit(X_train, y_train)
Upvotes: 0
Views: 543
Reputation: 46888
What you encountered is multicollinearity. If two or more of your predictors are highly correlated, the regression models only needs to use one of them and the others will be set to some nonsense value. If you look at the data:
X = houses_preprocessed.drop(columns=['price'])
y = houses_preprocessed['price']
import seaborn as sns
sns.clustermap(X.select_dtypes("number").corr(method="spearman"),figsize=(6, 6))
These three variables are highly correlated:
sns.pairplot(X[['bathrooms','sqft_above','sqft_living']])
So we keep one of them, and lastly, because you did one hot, you cannot fit an intercept, otherwise the one hot statezip will be a linear combination of your intercept:
X = pd.get_dummies(X.drop(columns=['bathrooms','sqft_above']))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
reg = LinearRegression(fit_intercept=False)
reg.fit(X_train, y_train)
Check the r2 :
reg.score(X_test,y_test)
0.7621069304476887
And the coefficients look ok now, considering the range of your y values:
res = pd.DataFrame({'coef':reg.coef_},index=X.columns)
res.reindex(res.coef.abs().sort_values().index)
coef
sqft_lot -0.023554
yr_built 54.699771
sqft_basement -100.401752
sqft_living 278.836773
statezip_WA 98006 565.521930
... ...
statezip_WA 98023 -342256.082284
statezip_WA 98070 -353819.063160
statezip_WA 98004 589945.748620
waterfront 621313.209967
statezip_WA 98039 816056.566554
Upvotes: 1