Reputation: 431
I have to create a regression model in python
Energy ratings Vs. price and see whether energy ratings depend on price or not.
Here, is the data set and code below,
import statsmodels.formula.api as smf
# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('price ~ energyrating', data=df)
model = model.fit()
The parameter I am getting is one negative, maybe that could be the reason for bad graph but not sure how to improve this.
model.params
#price=2.004943e+06 + (-.913381e+05)*energyrating
Intercept 2.004943e+06
energyrating -3.913381e+05
dtype: float64
and creating the final model which was unsuccessful,
# Predict values
pred = model.predict()
# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(df['energyrating'], df['price'], 'o') # scatter plot showing actual data
plt.plot(df['energyrating'], pred, 'r', linewidth=2) # regression line
plt.xlabel('Energy ratings')
plt.ylabel('Price')
plt.title('Energy ratings Vs. Price')
plt.show()
How do I improve this? Is the data unstable or any logical error I am missing out on?
Thanks in advance
EDIT:
Frequency graph of energy rating
This is how the energy rating is varying.
Upvotes: 1
Views: 91
Reputation: 894
I guess a simple linear regression
cannot capture the relationship between price
and energyrating
from the plot you gave since price
doesn't monotonically decrease or increase when energyrating
increases. I suggest you include a quadratic term of energyrating
, i.e., adding a new column of energyrating * energyrating
, or other higher-order transformations you consider reasonable.
If you are allowed to use other models other than linear regression
, I suggest you just average the price
over each energyrating
(it is discrete from your plot) bin and plot the curve, which I think would be nicer.
For example in pandas:
avg = df.groupby("energyrating")['price'].mean()
avg.plot()
Upvotes: 1