Reputation: 117
I'm comparing a few different machine learning algorithms for automated essay scoring accuracy. The RMSE and RSquared values I'm getting for training sets are about 0.75 and 0.43 on average respectively. But for some reason when I run KNN using the same function framework I get RMSE=0.95 and RSquared=0.09. I'm not getting any error messages either so I don't know what's going wrong.
My data set is continuous and I'm performing regression on it.
Here is a snippet of my code:
library(caret)
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
# Linear Regression ============================================================
lm <- train(holistic_score~.,
data=training,
trControl=train_control,
method="lm")
lm$results
lm_pred <- predict(lm, testing)
postResample(pred = lm_pred, obs = testing$holistic_score)
# Train: rmse = 0.714515 rsquared = 0.4737114
# Test: rmse = 0.7508373 rsquared = 0.4423288
# K-NN =========================================================================
knn <- train(holistic_score~.,
data=training,
trControl=train_control,
tuneLength=100,
method="knn")
knn$results
knn_pred <- predict(knn, testing)
postResample(pred=knn_pred, obs=testing$holistic_score)
# Train: rmse = 0.9466202 rsquared = 0.07567549
# Test: rmse = 0.9512989 rsquared = 0.0966448
I'm only showing linear regression but I'm using 10 different algorithms on 6 different data sets and across the board KNN is doing much worse compared to the rest.
I've tried looking online at the documentation and here but I haven't found anything that solves my problem or mentions it. This is the closest I've found to someone with a similar problem but it doesn't apply to me because I'm not using categorical predictors.
Does anyone know what could cause this?
Edit: Here is a histogram of the dependent variable (holistic_score):
Upvotes: 1
Views: 1258
Reputation: 46908
My guess is you did not scale your independent variables for knn, this is crucial when your independent variables are on different scales. You can see a interesting discussion here:
library(caret)
library(mlbench)
data(BostonHousing)
data = BostonHousing
train(medv ~.,data=data,method="knn",
trControl=trainControl(method="cv",number=3))
Summary of sample sizes: 337, 338, 337
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 6.721722 0.4748246 4.625845
7 6.897760 0.4429380 4.720363
9 6.807877 0.4550040 4.654680
train(medv ~.,data=data,method="knn",
trControl=trainControl(method="cv",number=3),
preProc = c("center", "scale"))
Pre-processing: centered (13), scaled (13)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 337, 338, 337
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 4.873476 0.7354566 3.120004
7 4.983704 0.7280253 3.125164
9 4.972269 0.7348006 3.172021
train(medv ~.,data=data,method="glmnet",
trControl=trainControl(method="cv",number=3))
alpha lambda RMSE Rsquared MAE
0.10 0.01355531 4.994509 0.7145962 3.483945
0.10 0.13555307 4.997304 0.7145864 3.466551
0.10 1.35553073 5.124558 0.7054928 3.504224
0.55 0.01355531 4.995748 0.7145269 3.483881
0.55 0.13555307 5.030863 0.7112925 3.463395
0.55 1.35553073 5.423348 0.6793556 3.745830
1.00 0.01355531 4.998020 0.7143324 3.482485
1.00 0.13555307 5.084050 0.7055959 3.485051
1.00 1.35553073 5.593417 0.6725029 3.904954
Upvotes: 2