Reputation: 6323
I am getting an error when I am trying to use randomForest in R. When I enter
basic3prox <- randomForest(activity ~.,data=train,proximity=TRUE,xtest=valid)
where train is a dataframe of training data and valid is a dataframe of test data, I get the following error
Error in randomForest.default(m, y, ...) :
x and xtest must have same number of columns
But they do have the same number of columns. I used subset() to get them from the same original dataset and when I run dim() i get
dim(train)
[1] 3237 563
dim(valid)
[1] 2630 563
So I am at a loss to figure out what is wrong here.
Upvotes: 3
Views: 4822
Reputation: 1
Maybe it is not a bug. When you use dim()
, you got different number. It means that training data and valid data do have different dims. I have encountered such problem. My solution is as following: First, I use names()
show the variables in the training data and in valid data. I see they do have different variables; Second, I use setdiff()
to "subtract" the surplus variables (if the training data has more variables than the valid data, then subtract the surplus variables in training data,vice versa.) After that, training data and valid data have the same variables. You can use randomForest.
Upvotes: 0
Reputation: 18628
No they don't; train
has 562 predictor columns and 1 decision column, so valid
must have 562 columns (and corresponding decision must be passed to ytest
argument).
So the invocation should look like:
randomForest(activity~.,data=train,proximity=TRUE,
xtest=valid[,names(valid)!='activity'],ytest=valid[,'activity'])
However, this is a dirty hack which will fail for more complex formulae and thus it shouldn't be used (even the authors tried to prohibit it, as Joran pointed out in comments). The correct, easier and faster way is to use separate objects for predictors and decisions instead of formulae, like this:
randomForest(trainPredictors,trainActivity,proximity=TRUE,
xtest=testPredictors,ytest=testActivity)
Upvotes: 4