Trisa Biswas
Trisa Biswas

Reputation: 593

Shape not aligned error in OLS Regression python

I have a dataframe where I am trying to run the statsmodel.api OLS regression. It is printing out the summary. But when I am using the predict() function, it is giving me an error -

shapes (75,7) and (6,) not aligned: 7 (dim 1) != 6 (dim 0)

My code is:

X = newdf.loc[:, newdf.columns != 'V-9'].values
y = newdf.iloc[:,3].values
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 
0.2,random_state=0)
import statsmodels.formula.api as sm
model = sm.OLS(y_train,X_train[:,[0,1,2,3,4,6]])
result = model.fit()
print(result.summary())`

Error comes on running this:

y_pred = result.predict(X_test)

Shape of my X_train is - (297,7)
Shape of my X_test is - (75,7)
dtype is numpy.ndarray

This question has been asked before. I have followed some posts on stackoverflow.com and tried to solve it using reshape function. However, it didnt help me. Can anyone explain why I am getting this error? and what is the solution?

Upvotes: 13

Views: 24316

Answers (2)

KBurchfiel
KBurchfiel

Reputation: 876

In my case, this error occurred because my original logistic regression had a constant (added using exog = sm.add_constant(exog)), but my test dataset didn't have this same constant. This caused a shape mismatch similar to the one that you had. I resolved the issue by adding a value of 1 to every row in my test DataFrame. (If this method is incorrect, or if there's a more elegant solution, please let me know.)

Upvotes: 1

D_Serg
D_Serg

Reputation: 494

model in line model = sm.OLS(y_train,X_train[:,[0,1,2,3,4,6]]), when trained that way, assumes the input data is 6-dimensional, as the 5th column of X_train is dropped. This requires the test data (in this case X_test) to be 6-dimensional too. This is why y_pred = result.predict(X_test) didn't work because X_test is originally 7-dimensional. The proper fix here is:

y_pred = result.predict(X_test[:, [0,1,2,3,4,6]]

BONUS

I see you are using the Pandas library. A better practice to drop columns is to use .drop so instead of

newdf.loc[:, newdf.columns != 'V-9'].values

you can use

newdf.drop('V-9', axis=1) # axis=1 makes sure cols are dropped, not rows

likewise instead of

X_train[:,[0,1,2,3,4,6]]

you can use

X_train.drop(X_train.columns[5], axis=1) # this like dropping the 5th column of the dataFrame

This makes it more readable and easier to code especially if you had 50 dimensions instead of 7.

I am glad it helps!

Upvotes: 6

Related Questions