Sean
Sean

Reputation: 117

KNN using Caret package giving bad results compared to other methods

I'm comparing a few different machine learning algorithms for automated essay scoring accuracy. The RMSE and RSquared values I'm getting for training sets are about 0.75 and 0.43 on average respectively. But for some reason when I run KNN using the same function framework I get RMSE=0.95 and RSquared=0.09. I'm not getting any error messages either so I don't know what's going wrong.

My data set is continuous and I'm performing regression on it.

Here is a snippet of my code:

library(caret)

train_control <- trainControl(method="repeatedcv", number=10, repeats=3)

# Linear Regression ============================================================
lm <- train(holistic_score~., 
            data=training, 
            trControl=train_control, 
            method="lm")
lm$results
lm_pred <- predict(lm, testing)
postResample(pred = lm_pred, obs = testing$holistic_score)
# Train:  rmse = 0.714515   rsquared = 0.4737114
# Test:   rmse = 0.7508373  rsquared = 0.4423288

# K-NN =========================================================================
knn <- train(holistic_score~.,
             data=training,
             trControl=train_control,
             tuneLength=100,
             method="knn")
knn$results
knn_pred <- predict(knn, testing)
postResample(pred=knn_pred, obs=testing$holistic_score)
# Train:  rmse = 0.9466202  rsquared = 0.07567549
# Test:   rmse = 0.9512989  rsquared = 0.0966448

I'm only showing linear regression but I'm using 10 different algorithms on 6 different data sets and across the board KNN is doing much worse compared to the rest.

I've tried looking online at the documentation and here but I haven't found anything that solves my problem or mentions it. This is the closest I've found to someone with a similar problem but it doesn't apply to me because I'm not using categorical predictors.

Does anyone know what could cause this?

Edit: Here is a histogram of the dependent variable (holistic_score):

enter image description here

Upvotes: 1

Views: 1258

Answers (1)

StupidWolf
StupidWolf

Reputation: 46908

My guess is you did not scale your independent variables for knn, this is crucial when your independent variables are on different scales. You can see a interesting discussion here:

library(caret)
library(mlbench)
data(BostonHousing)
data = BostonHousing

train(medv ~.,data=data,method="knn",
trControl=trainControl(method="cv",number=3))

Summary of sample sizes: 337, 338, 337 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  6.721722  0.4748246  4.625845
  7  6.897760  0.4429380  4.720363
  9  6.807877  0.4550040  4.654680

train(medv ~.,data=data,method="knn",
trControl=trainControl(method="cv",number=3),
preProc = c("center", "scale"))

Pre-processing: centered (13), scaled (13) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 337, 338, 337 
Resampling results across tuning parameters:

  k  RMSE      Rsquared   MAE     
  5  4.873476  0.7354566  3.120004
  7  4.983704  0.7280253  3.125164
  9  4.972269  0.7348006  3.172021

train(medv ~.,data=data,method="glmnet",
trControl=trainControl(method="cv",number=3))

  alpha  lambda      RMSE      Rsquared   MAE     
  0.10   0.01355531  4.994509  0.7145962  3.483945
  0.10   0.13555307  4.997304  0.7145864  3.466551
  0.10   1.35553073  5.124558  0.7054928  3.504224
  0.55   0.01355531  4.995748  0.7145269  3.483881
  0.55   0.13555307  5.030863  0.7112925  3.463395
  0.55   1.35553073  5.423348  0.6793556  3.745830
  1.00   0.01355531  4.998020  0.7143324  3.482485
  1.00   0.13555307  5.084050  0.7055959  3.485051
  1.00   1.35553073  5.593417  0.6725029  3.904954

Upvotes: 2

Related Questions