Reputation: 385
I am trying to build a time-series model using a random forest. However, I get the same mistake, everytime I run the code, which is:
Error in [.data.frame
(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
I know most of the theory behind random forests pretty well, but haven't really run much code using it.
Here is my code:
library(randomForest)
library(caret)
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = FALSE,
verboseIter = TRUE,
preProcOptions=list(thresh=0.95,na.remove=TRUE,verbose=TRUE))
set.seed(1234)
rf_grid <- expand.grid(mtry = c(1:6))
fit <- train(df.ts[,1]~.,
data=df.ts[,2:6],
method="rf",
preProcess=c("center","scale"),
tuneGrid = rf_grid,
trControl=fitControl,
ntree = 200,
metric="RMSE")
For a reproducible example, you can run the code on the following dataset:
df.ts <- structure(list(ts.t = c(315246, 219908, 193014, 231970, 248246,
+ 247112, 268218, 263637, 264306, 245730, 256548, 227525, 304468,
+ 229614, 202985), ts1 = c(233913, 315246, 219908, 193014, 231970,
+ 248246, 247112, 268218, 263637, 264306, 245730, 256548, 227525,
+ 304468, 229614), ts2 = c(253534, 233913, 315246, 219908, 193014,
+ 231970, 248246, 247112, 268218, 263637, 264306, 245730, 256548,
+ 227525, 304468), ts3 = c(226650, 253534, 233913, 315246, 219908,
+ 193014, 231970, 248246, 247112, 268218, 263637, 264306, 245730,
+ 256548, 227525), ts6 = c(213268, 242558, 250554, 226650, 253534,
+ 233913, 315246, 219908, 193014, 231970, 248246, 247112, 268218,
+ 263637, 264306), ts12 = c(333842, 210279, 193051, 174262, 216712,
+ 144327, 213268, 242558, 250554, 226650, 253534, 233913, 315246,
+ 219908, 193014)), .Names = c("ts.t", "ts1", "ts2", "ts3", "ts6", "ts12"), row.names = 13:27, class = "data.frame")
I hope someone can spot my error(s)
Thanks,
Upvotes: 4
Views: 21466
Reputation: 1248
For me using classProbs = TRUE
instead classProbs = FALSE
also worked.
Upvotes: 1
Reputation: 1
Just Use caret::train(var~., Data)
instead of train(Data$var~., data=Data)
and that should work.
Upvotes: 0
Reputation: 96
library(randomForest)
library(caret)
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = FALSE,
verboseIter = TRUE,
preProcOptions=list(thresh=0.95,na.remove=TRUE,verbose=TRUE))
set.seed(1234)
rf_grid <- expand.grid(mtry = c(1:6))
fit <- train(ts.t~.,
data=df.ts[,1:6],
method="rf",
preProcess=c("center","scale"),
tuneGrid = rf_grid,
trControl=fitControl,
ntree = 200,
metric="RMSE")
Note that the dependent variable should be in the data set provided to train the model and the instead of writing df.ts[, 1], correct notation would be the column name "ts.t" in relation with all columns of the data set provided i.e. from 2 to 6. This is resolve your error. CHEERS!!
Upvotes: 0
Reputation: 3688
The formula should correspond to the names of the variables in data
. E.g. y ~ .
predicts y
using all other variables in data
. Alternatively you could use y = df.ts[,1], x = df.ts[, -1]
instead of formula
and data
.
Thus the correct syntax would be:
fit <- train(ts.t ~ .,
data=df.ts,
method="rf",
preProcess=c("center","scale"),
tuneGrid = rf_grid,
trControl=fitControl,
ntree = 200,
metric="RMSE")
Upvotes: 4