Reputation: 667
Hi I am using the caret package and training a model with a knn algorithm but I am running into an error. I am using the german credit data and this is the structure of the data frame
'data.frame': 1000 obs. of 21 variables:
$ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1
$ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
$ credit_history : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5
$ purpose : Factor w/ 10 levels "business","car (new)",..: 8 8 5
$ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059
$ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1
$ employment_length : Factor w/ 5 levels "> 7 yrs","0 - 1 yrs",..: 1 3 4
$ installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
$ personal_status : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4
$ other_debtors : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3
$ residence_history : int 4 2 3 4 4 4 4 2 4 2 ...
$ property : Factor w/ 4 levels "building society savings",..:
$ age : int 67 22 49 45 53 35 53 35 61
$ installment_plan : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2
$ housing : Factor w/ 3 levels "for free","own",..: 2 2 1 2 3
$ existing_credits : int 2 1 1 1 2 1 1 1 ...
$ default : Factor w/ 2 levels "1","2": 1 2 1 1 2 1 1 2 ...
$ dependents : int 1 1 2 2 2 2 1 1 ...
$ telephone : Factor w/ 2 levels "none","yes": 2 1 1 1 2 1 1 .
$ foreign_worker : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 ...
$ job : Factor w/ 4 levels "mangement self-employed",..: 2
the target variable is credit$default
when I run the code
cv_opts = trainControl(method="repeatedcv", repeats = 5)
model_knn<-train(trainSet[,predictors],trainSet[,outcomeName],method="knn", trControl=cv_opts)
I get this error
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :3 NA's :3
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I use that same code with other methods, rpart, ada, and it works, it seems I am like I am missing something in the trControl for the knn?
Upvotes: 0
Views: 1351
Reputation: 19756
The problem is the fact knn
does not know how to handle categorical predictors when using the default S3 method of the caret train function:
Motor Screw Pgain Vgain Class
A:36 A:42 3:50 1:47 Min. : 1.00
B:36 B:35 4:66 2:49 1st Qu.:10.50
C:40 C:31 5:26 3:27 Median :18.00
D:22 D:30 6:25 4:22 Mean :21.17
E:33 E:29 5:22 3rd Qu.:33.50
Max. :51.00
so all the predictors are categorical
predictors <- colnames(Servo)[1:4]
cv_opts = trainControl(method="repeatedcv", repeats = 5)
model_knn <- train(Servo[predictors],
method = "knn",
trControl = cv_opts)
results in:
Something is wrong; all the RMSE metric values are missing:...
to overcome this one can use the formula S3 method for train:
model_knn <- train(Class~.,
data = Servo,
method = "knn",
trControl = cv_opts)
k-Nearest Neighbors
167 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 151, 149, 149, 150, 151, 151, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 9.124929 0.6404554 7.820686
7 9.356812 0.6393563 7.983302
9 9.775620 0.6169618 8.396535
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
Or you can build your own model matrix and use it in the default S3 method:
Servo_X <-
data = Servo)
model_knn2 <- train(Servo_X,
method = "knn",
trControl = cv_opts)
k-Nearest Neighbors
167 samples
16 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 149, 151, 151, 150, 151, 151, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 9.289972 0.6310129 7.869684
7 9.487649 0.6401052 8.021603
9 9.908227 0.6479472 8.604000
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
Additionally its a good idea to use preProc = c("center", "scale")
when using knn
since you want all the predictors to be on the same scale.
To understand what is happening when you use the formula interface check out:
as well as
Upvotes: 1