Reputation: 1971
I'm trying to calculate the score of a DecisionTreeRegressor with the following code:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn import tree
# some features are better using LabelEncoder like HouseStyle but the chance that they will affect
# the target LotFrontage are small so we just use HotEncoder and drop unwanted columns later
encoded_df = pd.get_dummies(train_df, prefix_sep="_", columns=['MSZoning', 'Street', 'Alley',
'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood',
'Condition1', 'Condition2', 'BldgType', 'HouseStyle'])
encoded_df = encoded_df[['LotFrontage', 'LotArea', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3',
'LotConfig_Corner', 'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3', 'LotConfig_Inside']]
# imputate LotFrontage with the mean value (we saw low outliers ratio so we gonna use the mean value)
encoded_df['LotFrontage'].fillna(encoded_df['LotFrontage'].mean(), inplace=True)
X = encoded_df.drop('LotFrontage', axis=1)
y = encoded_df['LotFrontage'].astype('int32')
X_train, X_test, y_train, y_test = train_test_split(X, y)
classifier = DecisionTreeRegressor()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_test = y_test.values.reshape(-1, 1)
classifier.score(y_test, y_pred)
print("Accuracy is: ", accuracy_score(y_test, y_pred) * 100)
when it's gets to calculating the score of the model I get the following error:
ValueError: Number of features of the model must match the input. Model n_features is 9 and input n_features is 1
Not sure as to why it happens because according sklearn docs the Test Samples are to be in the shape of (n_samples, n_features)
and y_test
is indeed in this shape:
y_test.shape # (365, 1)
and the True labels should be in the shape of (n_samples) or (n_samples, n_outputs)
and y_pred
is indeed in this shape:
y_pred.shape # (365,)
The dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
Upvotes: 1
Views: 108
Reputation: 599
The first argument of the score function shouldn't be the target value of the test set, it should be the input value of the test set, so you should do
classifier.score(X_test, y_test)
Upvotes: 2