Jorayen
Jorayen

Reputation: 1971

DecisionTreeRegressor score not calculated

I'm trying to calculate the score of a DecisionTreeRegressor with the following code:

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn import tree

# some features are better using LabelEncoder like HouseStyle but the chance that they will affect
# the target LotFrontage are small so we just use HotEncoder and drop unwanted columns later
encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
                                                       'LotShape', 'LandContour', 'Utilities',
                                                       'LotConfig', 'LandSlope', 'Neighborhood',
                                                       'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])
encoded_df = encoded_df[['LotFrontage', 'LotArea', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3',
           'LotConfig_Corner', 'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3', 'LotConfig_Inside']]

# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use the mean value)
encoded_df['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X = encoded_df.drop('LotFrontage', axis=1)
y = encoded_df['LotFrontage'].astype('int32')
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = DecisionTreeRegressor()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_test = y_test.values.reshape(-1, 1)
classifier.score(y_test, y_pred)
print("Accuracy is: ", accuracy_score(y_test, y_pred) * 100)

when it's gets to calculating the score of the model I get the following error:

ValueError: Number of features of the model must match the input. Model n_features is 9 and input n_features is 1 

Not sure as to why it happens because according sklearn docs the Test Samples are to be in the shape of (n_samples, n_features) and y_test is indeed in this shape:

y_test.shape # (365, 1)

and the True labels should be in the shape of (n_samples) or (n_samples, n_outputs) and y_pred is indeed in this shape:

y_pred.shape # (365,)

The dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Upvotes: 1

Views: 108

Answers (1)

The first argument of the score function shouldn't be the target value of the test set, it should be the input value of the test set, so you should do

classifier.score(X_test, y_test)

Upvotes: 2

Related Questions