chrisfs
chrisfs

Reputation: 6323

Error "x and xtest must have the same number of columns" when using randomForest

I am getting an error when I am trying to use randomForest in R. When I enter

basic3prox  <- randomForest(activity ~.,data=train,proximity=TRUE,xtest=valid)

where train is a dataframe of training data and valid is a dataframe of test data, I get the following error

Error in randomForest.default(m, y, ...) : 
  x and xtest must have same number of columns

But they do have the same number of columns. I used subset() to get them from the same original dataset and when I run dim() i get

dim(train)

[1] 3237 563

dim(valid)

[1] 2630 563

So I am at a loss to figure out what is wrong here.

Upvotes: 3

Views: 4822

Answers (2)

iris
iris

Reputation: 1

Maybe it is not a bug. When you use dim(), you got different number. It means that training data and valid data do have different dims. I have encountered such problem. My solution is as following: First, I use names() show the variables in the training data and in valid data. I see they do have different variables; Second, I use setdiff() to "subtract" the surplus variables (if the training data has more variables than the valid data, then subtract the surplus variables in training data,vice versa.) After that, training data and valid data have the same variables. You can use randomForest.

Upvotes: 0

mbq
mbq

Reputation: 18628

No they don't; train has 562 predictor columns and 1 decision column, so valid must have 562 columns (and corresponding decision must be passed to ytest argument).
So the invocation should look like:

randomForest(activity~.,data=train,proximity=TRUE,
  xtest=valid[,names(valid)!='activity'],ytest=valid[,'activity'])

However, this is a dirty hack which will fail for more complex formulae and thus it shouldn't be used (even the authors tried to prohibit it, as Joran pointed out in comments). The correct, easier and faster way is to use separate objects for predictors and decisions instead of formulae, like this:

randomForest(trainPredictors,trainActivity,proximity=TRUE,
  xtest=testPredictors,ytest=testActivity)

Upvotes: 4

Related Questions