Reputation: 19
I'm using R and the this breastCancer
data frame. I want to use the function train
in the packages caret
but it doesn't work because of the error below. However, when I use another data frame, the function works.
library(mlbench)
library(caret)
data("breastCancer")
BC = na.omit(breastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")
This is the error:
error : In .local(x, ...) : Variable(s) `' constant. Cannot scale data.
Upvotes: 0
Views: 2071
Reputation: 46968
We can start with the data you have:
library(mlbench)
library(caret)
data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
str(BC)
'data.frame': 683 obs. of 10 variables:
$ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
$ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
$ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
$ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
$ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
$ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
$ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
$ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
$ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
$ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
BC
is a data.frame and you can see all your predictors are categorical or ordinal. You are trying to do a svmRadial meaning a svm with radial basis function. It's not so trivial to calculate euclidean distance between categorical features and if you look at the distribution of your categories:
sapply(BC,table)
$Cl.thickness
1 2 3 4 5 6 7 8 9 10
139 50 104 79 128 33 23 44 14 69
$Cell.size
1 2 3 4 5 6 7 8 9 10
373 45 52 38 30 25 19 28 6 67
$Cell.shape
1 2 3 4 5 6 7 8 9 10
346 58 53 43 32 29 30 27 7 58
$Marg.adhesion
1 2 3 4 5 6 7 8 9 10
393 58 58 33 23 21 13 25 4 55
When you train the model, by default it is bootstrap, some of your training data will be missing the levels that are lowly represented, for example from the above table, category 9 for Marg.adhesion
. And this variable becomes all zero for this training, hence it throws the error. It most likely doesn't affect the overall result much (since they are rare).
One solution is to use cross-validation (it is unlikely you select all the rare observations in the test fold). Note, you should never convert into a matrix using as.matrix()
when you have a data.frame with factors and characters. Caret can handle data.frame like this:
train(Class ~.,data=BC,method="svmRadial",trControl=trainControl(method="cv"))
Support Vector Machines with Radial Basis Function Kernel
683 samples
9 predictor
2 classes: 'benign', 'malignant'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 614, 615, 615, 615, 616, 615, ...
Resampling results across tuning parameters:
C Accuracy Kappa
0.25 0.9575654 0.9101995
0.50 0.9619346 0.9190284
1.00 0.9633838 0.9220161
Tuning parameter 'sigma' was held constant at a value of 0.01841092
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.01841092 and C = 1.
The other option if you want to use bootstrap for cross-valiation, is to either omit the observations with these low classes, or combine them with others.
Upvotes: 1
Reputation: 8198
Your code contains some typos like the package name is caret
not caren
and dataset name is BreastCancer
not breastCancer
. You can use the following code to get rid of errors
library(mlbench)
library(caret)
data(BreastCancer)
BC = na.omit(BreastCancer[,-1])
a = train(Class~., data = as.matrix(BC), method = "svmRadial")
It returns me
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 683 samples
#> 9 predictor
#> 2 classes: 'benign', 'malignant'
#>
#> No pre-processing
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 683, 683, 683, 683, 683, 683, ...
#> Resampling results across tuning parameters:
#>
#> C Accuracy Kappa
#> 0.25 0.9550137 0.9034390
#> 0.50 0.9585504 0.9107666
#> 1.00 0.9611485 0.9161541
#>
#> Tuning parameter 'sigma' was held constant at a value of 0.02349173
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 0.02349173 and C = 1.
Upvotes: 0