Klausos Klausos
Klausos Klausos

Reputation: 16060

ValueError when fitting a model

I am running this code just to check how the linear regression model works in python:

import pandas as pd
import numpy as np
import statsmodels.api as sm

train = pd.read_csv('data/train.csv', parse_dates=[0])
test = pd.read_csv('data/test.csv', parse_dates=[0])

print train.head()

#Feature engineering
temp_train = pd.DatetimeIndex(train['datetime'])
train['year'] = temp_train.year
train['month'] = temp_train.month
train['hour'] = temp_train.hour
train['weekday'] = temp_train.weekday

temp_test = pd.DatetimeIndex(test['datetime'])
test['year'] = temp_test.year
test['month'] = temp_test.month
test['hour'] = temp_test.hour
test['weekday'] = temp_test.weekday

#Define features vector
features = ['season', 'holiday', 'workingday', 'weather',
            'temp', 'atemp', 'humidity', 'windspeed', 'year',
            'month', 'weekday', 'hour']

#The evaluation metric is the RMSE in the log domain,
#so we should transform the target columns into log domain as well.
for col in ['casual', 'registered', 'count']:
    train['log-' + col] = train[col].apply(lambda x: np.log1p(x))

#Split train data set into training and validation sets
training, validation = train[:int(0.8*len(train))], train[int(0.8*len(train)):]

# Create a linear model
X = sm.add_constant(training[features])
model = sm.OLS(training['log-count'],X) # OLS stands for Ordinary Least Squares
f = model.fit()

ypred = f.predict(sm.add_constant(validation[features]))
print(ypred)

plt.figure();
plt.plot(validation[features], ypred, 'o', validation[features], validation['log-count'], 'b-');
plt.title('blue: true,   red: OLS');

The following error message pops up. What does it mean and how to fix it?

Traceback (most recent call last):
  File "C:/TestModel/linear_regression.py", line 99, in <module>
    ypred = f.predict(sm.add_constant(validation[features]))
  File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 749, in predict
    return self.model.predict(self.params, exog, *args, **kwargs)
  File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 359, in predict
    return np.dot(exog, params)
ValueError: shapes (2178,12) and (13,) not aligned: 12 (dim 1) != 13 (dim 0)

This is the data sample:

print training.head()
             datetime  season  holiday  workingday  weather  temp   atemp  \
0 2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1 2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2 2011-01-01 02:00:00       1        0           0        1  9.02  13.635   
3 2011-01-01 03:00:00       1        0           0        1  9.84  14.395   
4 2011-01-01 04:00:00       1        0           0        1  9.84  14.395   

   humidity  windspeed  casual  registered  count  year  month  hour  weekday  \
0        81          0       3          13     16  2011      1     0        5   
1        80          0       8          32     40  2011      1     1        5   
2        80          0       5          27     32  2011      1     2        5   
3        75          0       3          10     13  2011      1     3        5   
4        75          0       0           1      1  2011      1     4        5   

   log-casual  log-registered  log-count  
0    1.386294        2.639057   2.833213  
1    2.197225        3.496508   3.713572  
2    1.791759        3.332205   3.496508  
3    1.386294        2.397895   2.639057  
4    0.000000        0.693147   0.693147  


print validation.head()
                datetime  season  holiday  workingday  weather   temp   atemp  \
8708 2012-08-05 05:00:00       3        0           0        1  29.52  34.850   
8709 2012-08-05 06:00:00       3        0           0        1  29.52  34.850   
8710 2012-08-05 07:00:00       3        0           0        1  30.34  35.605   
8711 2012-08-05 08:00:00       3        0           0        1  31.16  36.365   
8712 2012-08-05 09:00:00       3        0           0        1  32.80  38.635   

      humidity  windspeed  casual  registered  count  year  month  hour  \
8708        74    16.9979       1          18     19  2012      8     5   
8709        79    16.9979       7          12     19  2012      8     6   
8710        74    19.9995      18          50     68  2012      8     7   
8711        66    22.0028      27          81    108  2012      8     8   
8712        59    23.9994      61         168    229  2012      8     9   

      weekday  log-casual  log-registered  log-count  
8708        6    0.693147        2.944439   2.995732  
8709        6    2.079442        2.564949   2.995732  
8710        6    2.944439        3.931826   4.234107  
8711        6    3.332205        4.406719   4.691348  
8712        6    4.127134        5.129899   5.438079  

Upvotes: 1

Views: 2819

Answers (1)

Josef
Josef

Reputation: 22897

This looks like a design problem for the add_constant function for this use case.

from the docstring:

" For ndarrays and pandas.DataFrames, checks to make sure a constant is not already included. If there is at least one column of ones then the original object is returned. "

http://statsmodels.sourceforge.net/devel/_modules/statsmodels/tools/tools.html#add_constant

I think this was defined this way to avoid singular design matrices for estimation, but predict will work also with singular matrices.

My guess is that your validation data has one column with all identical values, for example they could all be from the same year. If this is intentional, then you need to add the constant manually to the dataframe.

It would be better if add_constant has an option to turn of this behavior.

Upvotes: 2

Related Questions