Reputation: 16060
I am running this code just to check how the linear regression model works in python:
import pandas as pd
import numpy as np
import statsmodels.api as sm
train = pd.read_csv('data/train.csv', parse_dates=[0])
test = pd.read_csv('data/test.csv', parse_dates=[0])
print train.head()
#Feature engineering
temp_train = pd.DatetimeIndex(train['datetime'])
train['year'] = temp_train.year
train['month'] = temp_train.month
train['hour'] = temp_train.hour
train['weekday'] = temp_train.weekday
temp_test = pd.DatetimeIndex(test['datetime'])
test['year'] = temp_test.year
test['month'] = temp_test.month
test['hour'] = temp_test.hour
test['weekday'] = temp_test.weekday
#Define features vector
features = ['season', 'holiday', 'workingday', 'weather',
'temp', 'atemp', 'humidity', 'windspeed', 'year',
'month', 'weekday', 'hour']
#The evaluation metric is the RMSE in the log domain,
#so we should transform the target columns into log domain as well.
for col in ['casual', 'registered', 'count']:
train['log-' + col] = train[col].apply(lambda x: np.log1p(x))
#Split train data set into training and validation sets
training, validation = train[:int(0.8*len(train))], train[int(0.8*len(train)):]
# Create a linear model
X = sm.add_constant(training[features])
model = sm.OLS(training['log-count'],X) # OLS stands for Ordinary Least Squares
f = model.fit()
ypred = f.predict(sm.add_constant(validation[features]))
print(ypred)
plt.figure();
plt.plot(validation[features], ypred, 'o', validation[features], validation['log-count'], 'b-');
plt.title('blue: true, red: OLS');
The following error message pops up. What does it mean and how to fix it?
Traceback (most recent call last):
File "C:/TestModel/linear_regression.py", line 99, in <module>
ypred = f.predict(sm.add_constant(validation[features]))
File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 749, in predict
return self.model.predict(self.params, exog, *args, **kwargs)
File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 359, in predict
return np.dot(exog, params)
ValueError: shapes (2178,12) and (13,) not aligned: 12 (dim 1) != 13 (dim 0)
This is the data sample:
print training.head()
datetime season holiday workingday weather temp atemp \
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395
humidity windspeed casual registered count year month hour weekday \
0 81 0 3 13 16 2011 1 0 5
1 80 0 8 32 40 2011 1 1 5
2 80 0 5 27 32 2011 1 2 5
3 75 0 3 10 13 2011 1 3 5
4 75 0 0 1 1 2011 1 4 5
log-casual log-registered log-count
0 1.386294 2.639057 2.833213
1 2.197225 3.496508 3.713572
2 1.791759 3.332205 3.496508
3 1.386294 2.397895 2.639057
4 0.000000 0.693147 0.693147
print validation.head()
datetime season holiday workingday weather temp atemp \
8708 2012-08-05 05:00:00 3 0 0 1 29.52 34.850
8709 2012-08-05 06:00:00 3 0 0 1 29.52 34.850
8710 2012-08-05 07:00:00 3 0 0 1 30.34 35.605
8711 2012-08-05 08:00:00 3 0 0 1 31.16 36.365
8712 2012-08-05 09:00:00 3 0 0 1 32.80 38.635
humidity windspeed casual registered count year month hour \
8708 74 16.9979 1 18 19 2012 8 5
8709 79 16.9979 7 12 19 2012 8 6
8710 74 19.9995 18 50 68 2012 8 7
8711 66 22.0028 27 81 108 2012 8 8
8712 59 23.9994 61 168 229 2012 8 9
weekday log-casual log-registered log-count
8708 6 0.693147 2.944439 2.995732
8709 6 2.079442 2.564949 2.995732
8710 6 2.944439 3.931826 4.234107
8711 6 3.332205 4.406719 4.691348
8712 6 4.127134 5.129899 5.438079
Upvotes: 1
Views: 2819
Reputation: 22897
This looks like a design problem for the add_constant
function for this use case.
from the docstring:
" For ndarrays and pandas.DataFrames, checks to make sure a constant is not already included. If there is at least one column of ones then the original object is returned. "
http://statsmodels.sourceforge.net/devel/_modules/statsmodels/tools/tools.html#add_constant
I think this was defined this way to avoid singular design matrices for estimation, but predict
will work also with singular matrices.
My guess is that your validation
data has one column with all identical values, for example they could all be from the same year.
If this is intentional, then you need to add the constant manually to the dataframe.
It would be better if add_constant
has an option to turn of this behavior.
Upvotes: 2