pApaAPPApapapa
pApaAPPApapapa

Reputation: 385

Random forest error: Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) : undefined columns selected

I am trying to build a time-series model using a random forest. However, I get the same mistake, everytime I run the code, which is:

Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) : undefined columns selected

I know most of the theory behind random forests pretty well, but haven't really run much code using it.

Here is my code:

library(randomForest)
library(caret)

fitControl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 1,
  classProbs = FALSE,
  verboseIter = TRUE,
  preProcOptions=list(thresh=0.95,na.remove=TRUE,verbose=TRUE))

set.seed(1234)

rf_grid <- expand.grid(mtry = c(1:6))

fit <- train(df.ts[,1]~.,
         data=df.ts[,2:6],
         method="rf",
         preProcess=c("center","scale"),
         tuneGrid = rf_grid,
         trControl=fitControl,
         ntree = 200,
         metric="RMSE")

For a reproducible example, you can run the code on the following dataset:

 df.ts <- structure(list(ts.t = c(315246, 219908, 193014, 231970, 248246, 
 + 247112, 268218, 263637, 264306, 245730, 256548, 227525, 304468,
 + 229614, 202985), ts1 = c(233913, 315246, 219908, 193014, 231970, 
 +  248246, 247112, 268218, 263637, 264306, 245730, 256548, 227525, 
 +   304468, 229614), ts2 = c(253534, 233913, 315246, 219908, 193014, 
 +   231970, 248246, 247112, 268218, 263637, 264306, 245730, 256548, 
 +  227525, 304468), ts3 = c(226650, 253534, 233913, 315246, 219908, 
 +  193014, 231970, 248246, 247112, 268218, 263637, 264306, 245730, 
 +  256548, 227525), ts6 = c(213268, 242558, 250554, 226650, 253534, 
 +  233913, 315246, 219908, 193014, 231970, 248246, 247112, 268218, 
 + 263637, 264306), ts12 = c(333842, 210279, 193051, 174262, 216712, 
 +  144327, 213268, 242558, 250554, 226650, 253534, 233913, 315246, 
 +  219908, 193014)), .Names = c("ts.t", "ts1", "ts2", "ts3", "ts6", "ts12"), row.names = 13:27, class = "data.frame")

I hope someone can spot my error(s)

Thanks,

Upvotes: 4

Views: 21466

Answers (4)

Priyansh
Priyansh

Reputation: 1248

For me using classProbs = TRUE instead classProbs = FALSE also worked.

Upvotes: 1

Meera Bankar
Meera Bankar

Reputation: 1

Just Use caret::train(var~., Data) instead of train(Data$var~., data=Data) and that should work.

Upvotes: 0

lalit panwar
lalit panwar

Reputation: 96

library(randomForest)
library(caret)

fitControl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 1,
  classProbs = FALSE,
  verboseIter = TRUE,
  preProcOptions=list(thresh=0.95,na.remove=TRUE,verbose=TRUE))

set.seed(1234)

rf_grid <- expand.grid(mtry = c(1:6))

fit <- train(ts.t~.,
         data=df.ts[,1:6],
         method="rf",
         preProcess=c("center","scale"),
         tuneGrid = rf_grid,
         trControl=fitControl,
         ntree = 200,
         metric="RMSE")

Note that the dependent variable should be in the data set provided to train the model and the instead of writing df.ts[, 1], correct notation would be the column name "ts.t" in relation with all columns of the data set provided i.e. from 2 to 6. This is resolve your error. CHEERS!!

Upvotes: 0

thie1e
thie1e

Reputation: 3688

The formula should correspond to the names of the variables in data. E.g. y ~ . predicts y using all other variables in data. Alternatively you could use y = df.ts[,1], x = df.ts[, -1] instead of formula and data.

Thus the correct syntax would be:

fit <- train(ts.t ~ .,
             data=df.ts,
             method="rf",
             preProcess=c("center","scale"),
             tuneGrid = rf_grid,
             trControl=fitControl,
             ntree = 200,
             metric="RMSE") 

Upvotes: 4

Related Questions