Jleeca
Jleeca

Reputation: 29

linear regression/ols regression with python

I want to run multiple linear regression models, and there are 5 independent variables (2 of them are categorical).

Thus, I first applied onehotencoder to change categorical variables into dummies.

These are dependent and independent variables

y = df['price']
x = df[['age', 'totalRooms', 'elevator',
        'floorLevel_bottom', 'floorLevel_high', 
        'floorLevel_low',
        'floorLevel_medium','floorLevel_top',
        'buildingType_bungalow', 'buildingType_plate', 
        'buildingType_plate_tower', 'buildingType_tower']]

Next, I tried the following two methods, but found that their results are different only for the intercept and categorical variables.

from sklearn.linear_model import LinearRegression

mlr = linear_model.LinearRegression()
mlr.fit(x, y)

print('Intercept: \n', mlr_in.intercept_)
print("Coefficients:")
list(zip(x, mlr_in.coef_))

This gives

Intercept: 35228.96453917408

Coefficients: [('age', 1046.5347118942063), ('totalRooms', -797.7667275033103), ('elevator', 11940.629576736419), ('floorLevel_bottom', 1011.5929167549165), ('floorLevel_high', 157.60625500592502), ('floorLevel_low', 483.89164772666277), ('floorLevel_medium', 630.9547280568961), ('floorLevel_top', -2284.0455475443687), ('buildingType_bungalow', 31610.88176756009), ('buildingType_plate', -9649.087529585862), ('buildingType_plate_tower', -8813.187607409624), ('buildingType_tower', -13148.606630564624)]

import statsmodels.formula.api as smf

x_in = sm.add_constant(x_in)
model = sm.OLS(y, x_in).fit()
print(model.summary())

but this gives


Intercept 2.43e+04
age 1046.5347
totalRooms -797.7667
elevator 1.194e+04
floorLevel_bottom 5870.7604
floorLevel_high 5016.7738
floorLevel_low 5343.0592
floorLevel_medium 5490.1223
floorLevel_top 2575.1220
buildingType_bungalow 3.768e+04
buildingType_plate -3575.1281
buildingType_plate_tower -2739.2282
buildingType_tower -7074.6472

Now I don't understand the difference between them ;(

Upvotes: 1

Views: 212

Answers (1)

Next Door Engineer
Next Door Engineer

Reputation: 2886

Few things to take care of assuming you have done data preprocessing exactly for each iteration. (By the variable names I think there might be something else you might've done)

  1. Set the seed to the same number so that results will pick the same random number, to begin with.
  2. Avoid dummy variable trtap and use pd.get_dummies(x, columns=['floorLevel', 'buildingType'], drop_first=True)

Upvotes: 0

Related Questions