Reputation: 125
I am trying to use the Decision Tree C 4.5 algorithm with 10 - Fold Cross Validation for Web Spam Detection. My dataset basically has 8944 observations and 36 variables after doing feature selection.
Here is my code:
#dividing the dataset into train and test
trainRowNumbers<-createDataPartition(final1$spam,p=0.7,list=FALSE)
#Create the training dataset
trainData<-final1[trainRowNumbers,]
#Create Test data
testData<-final1[-trainRowNumbers,]
#C4.5 using 10 fold cross validation
set.seed(1958)
train_control<-createFolds(trainData$spam,k=10)
C45Fit<-train(spam~.,method="J48",data=trainData,
tuneLength=15,
trControl=trainControl(
method="cv",indexOut = train_control ))
This is the Error that I am getting:
C45Fit<-train(spam~.,method="J48",data=trainData,
tuneLength=15,
trControl=trainControl(
method="cv",indexOut = train_control ))
Error in train(spam ~ ., method = "J48", data = trainData, tuneLength = 15, : unused arguments (method = "J48", data = trainData, tuneLength = 15, trControl = trainControl(method = "cv", indexOut = train_control))
I have got a couple of questions:
How do I resolve this Error?
How to set the tuneLength parameter?
Head of my Dataset:
> head(trainData)
hostid host HST_4 HST_6 HST_7 HST_8 HST_9 HST_10 HST_16
1 0 007cleaningagent.co.uk 0.03370787 1.9791304 0.1123596 0.1516854 0.2247191 0.2977528 0.07865169
2 1 0800.loan-line.co.uk 1.39539347 2.4222020 0.2284069 0.2610365 0.3531670 0.4529750 0.02879079
4 3 102belfast.boys-brigade.org.uk 0.29729730 1.1800000 0.2162162 0.3783784 0.5135135 0.5405405 0.21621622
5 4 10bristol.boys-brigade.org.uk 0.28804348 1.7745267 0.1141304 0.1847826 0.2608696 0.3750000 0.08152174
6 5 10enfield.boys-brigade.org.uk 0.00000000 0.8468468 0.0625000 0.1875000 0.1875000 0.3125000 0.06250000
8 8 13thcoventry.co.uk 0.05797101 2.1113074 0.2318841 0.3091787 0.3961353 0.5507246 0.09178744
HST_17 HST_18 HST_20 HMG_29 HMG_40 HMG_41 HMG_42 AVG_50 AVG_51 AVG_55 AVG_57
1 0.15730337 0.2247191 0.070 0.2907760 0.02702703 0.07207207 0.1351351 32431.65 7.215054 0.02289305 0.2980171
2 0.05566219 0.1094050 0.075 0.0495162 0.10641628 0.17840376 0.2410016 150592.89 2.000000 0.49661240 0.1137439
4 0.37837838 0.4054054 0.040 0.2156130 0.03971119 0.11552347 0.1480144 16129.61 2.125000 0.12297815 0.2033877
5 0.13043478 0.2119565 0.075 0.0405612 0.08152174 0.13043478 0.2119565 28759.75 2.870968 0.19622331 0.0673372
6 0.18750000 0.2500000 0.005 0.1125400 0.02528090 0.12359551 0.1432584 70966.61 2.000000 0.03948338 0.2513755
8 0.14975845 0.2512077 0.095 0.1946150 0.04382470 0.10458167 0.1633466 109388.89 11.484940 0.03547817 0.1387366
AVG_58 AVG_59 AVG_61 AVG_63 AVG_65 AVG_67 STD_77 STD_79 STD_80 STD_81
1 0.030079101 1.888686 0.04982536 0.07119317 0.1539772 0.2237475 0.02240051 0.04634758 0.0003248904 0.07644575
2 0.005874481 2.423238 0.14016213 0.17484142 0.2460647 0.3279534 0.03014901 0.05352347 0.0006170884 0.09449420
4 0.017285860 1.657795 0.08748573 0.14192639 0.2273218 0.2815660 0.03715705 0.07385004 0.0021174754 0.15725521
5 0.007008439 1.656472 0.10088409 0.17370255 0.2791502 0.3839271 0.03382564 0.07695898 0.0011314215 0.14290420
6 0.017145414 2.284363 0.09245673 0.14045514 0.2267635 0.2907555 0.02459505 0.06418522 0.0007756064 0.16533374
8 0.001818059 2.300361 0.17326186 0.25910768 0.3351511 0.4479340 0.05611160 0.07531329 0.0005475770 0.15796253
STD_83 STD_84 STD_85 STD_87 STD_94 spam
1 0.1219990 0.001009964 0.04043011 0.04198925 0.3400028 normal
2 0.1539489 0.001734261 0.15000000 0.16000000 0.3147682 normal
4 0.2027374 0.006655953 0.06437500 0.06031250 0.7100778 normal
5 0.1925378 0.002708827 0.04258065 0.05290323 0.8195509 normal
6 0.2223814 0.005491305 0.09125000 0.08062500 1.2953592 normal
8 0.2366591 0.002588343 0.21698795 0.14774096 0.2882247 normal
Output of sessionInfo()
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 ggthemes_3.5.0 randomForest_4.6-12 Metrics_0.1.3 RWeka_0.4-37 mlr_2.12.1
[7] ParamHelpers_1.10 rgeos_0.3-26 VIM_4.7.0 data.table_1.10.4-3 colorspace_1.3-2 mice_2.46.0
[13] RANN_2.5.1 kernlab_0.9-25 mlbench_2.1-1 caret_6.0-79 ggplot2_2.2.1 lattice_0.20-35
[19] dplyr_0.7.4
loaded via a namespace (and not attached):
[1] nlme_3.1-131 lubridate_1.7.3 bit64_0.9-7 dimRed_0.1.0 httr_1.3.1 backports_1.1.2 tools_3.4.0
[8] R6_2.2.2 rpart_4.1-11 DBI_0.8 lazyeval_0.2.1 nnet_7.3-12 withr_2.1.0 sp_1.2-7
[15] tidyselect_0.2.3 mnormt_1.5-5 parallelMap_1.3 bit_1.1-12 curl_3.0 compiler_3.4.0 checkmate_1.8.5
[22] scales_0.5.0 sfsmisc_1.1-1 DEoptimR_1.0-8 lmtest_0.9-35 psych_1.7.8 robustbase_0.92-8 stringr_1.2.0
[29] foreign_0.8-67 rio_0.5.10 pkgconfig_2.0.1 RWekajars_3.9.2-1 rlang_0.2.0 readxl_1.0.0 ddalpha_1.3.1
[36] BBmisc_1.11 bindr_0.1 zoo_1.8-0 ModelMetrics_1.1.0 car_3.0-0 magrittr_1.5 Matrix_1.2-12
[43] Rcpp_0.12.14 munsell_0.4.3 abind_1.4-5 stringi_1.1.6 carData_3.0-1 MASS_7.3-47 plyr_1.8.4
[50] recipes_0.1.1 parallel_3.4.0 forcats_0.3.0 haven_1.1.1 splines_3.4.0 pillar_1.2.1 boot_1.3-19
[57] rjson_0.2.15 reshape2_1.4.2 codetools_0.2-15 stats4_3.4.0 CVST_0.2-1 glue_1.2.0 laeken_0.4.6
[64] vcd_1.4-4 foreach_1.4.3 twitteR_1.1.9 cellranger_1.1.0 gtable_0.2.0 purrr_0.2.4 tidyr_0.7.2
[71] assertthat_0.2.0 DRR_0.0.2 gower_0.1.2 openxlsx_4.0.17 prodlim_1.6.1 broom_0.4.3 e1071_1.6-8
[78] class_7.3-14 survival_2.41-3 timeDate_3042.101 RcppRoll_0.2.2 tibble_1.4.2 rJava_0.9-9 iterators_1.0.8
[85] lava_1.5.1 ipred_0.9-6
Thanks for any suggestions provided in advance.
Upvotes: 1
Views: 761
Reputation: 13118
I could replicate the error message in the following way:
library(RWeka)
library(caret)
library(mlr)
# Loading required package: ParamHelpers
# Attaching package: ‘mlr’
# The following object is masked from ‘package:caret’:
# train
#dividing the dataset into train and test
trainRowNumbers <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
#Create the training dataset
trainData <- iris[trainRowNumbers, ]
#Create Test data
testData <- iris[-trainRowNumbers, ]
#C4.5 using 10 fold cross validation
set.seed(1958)
train_control <- createFolds(trainData$Species, k = 10)
C45Fit <- train(Species~., method = "J48",data = trainData,
tuneLength = 15,
trControl = trainControl(
method = "cv",indexOut = train_control ))
# Error in train(Species ~ ., method = "J48", data = trainData, tuneLength = 15, :
# unused arguments (method = "J48", data = trainData, tuneLength = 15, trControl = trainControl(method = "cv", indexOut = train_control))
Notice the message The following object is masked from ‘package:caret’: train
. If you load another package with a train
function (e.g. mlr
in this case) after you load caret
, by default R will use the train
from the most recently loaded package. (This is why I asked for sessionInfo()
, to see what packages have been loaded. For the same reason, the replicable example should include the packages you loaded.) Instead of train
from caret
, R runs train
from mlr
(or some other package you loaded), which returns the error message.
The solution is to either load caret
last, or explicitly call the train
function from caret
using caret::train(...)
.
Upvotes: 2