Reputation: 29
I want to run multiple linear regression models, and there are 5 independent variables (2 of them are categorical).
Thus, I first applied onehotencoder to change categorical variables into dummies.
These are dependent and independent variables
y = df['price']
x = df[['age', 'totalRooms', 'elevator',
'floorLevel_bottom', 'floorLevel_high',
'floorLevel_low',
'floorLevel_medium','floorLevel_top',
'buildingType_bungalow', 'buildingType_plate',
'buildingType_plate_tower', 'buildingType_tower']]
Next, I tried the following two methods, but found that their results are different only for the intercept and categorical variables.
from sklearn.linear_model import LinearRegression
mlr = linear_model.LinearRegression()
mlr.fit(x, y)
print('Intercept: \n', mlr_in.intercept_)
print("Coefficients:")
list(zip(x, mlr_in.coef_))
This gives
Intercept: 35228.96453917408
Coefficients: [('age', 1046.5347118942063), ('totalRooms', -797.7667275033103), ('elevator', 11940.629576736419), ('floorLevel_bottom', 1011.5929167549165), ('floorLevel_high', 157.60625500592502), ('floorLevel_low', 483.89164772666277), ('floorLevel_medium', 630.9547280568961), ('floorLevel_top', -2284.0455475443687), ('buildingType_bungalow', 31610.88176756009), ('buildingType_plate', -9649.087529585862), ('buildingType_plate_tower', -8813.187607409624), ('buildingType_tower', -13148.606630564624)]
import statsmodels.formula.api as smf
x_in = sm.add_constant(x_in)
model = sm.OLS(y, x_in).fit()
print(model.summary())
but this gives
Intercept 2.43e+04
age 1046.5347
totalRooms -797.7667
elevator 1.194e+04
floorLevel_bottom 5870.7604
floorLevel_high 5016.7738
floorLevel_low 5343.0592
floorLevel_medium 5490.1223
floorLevel_top 2575.1220
buildingType_bungalow 3.768e+04
buildingType_plate -3575.1281
buildingType_plate_tower -2739.2282
buildingType_tower -7074.6472
Now I don't understand the difference between them ;(
Upvotes: 1
Views: 212
Reputation: 2886
Few things to take care of assuming you have done data preprocessing exactly for each iteration. (By the variable names I think there might be something else you might've done)
pd.get_dummies(x, columns=['floorLevel', 'buildingType'], drop_first=True)
Upvotes: 0