Reputation: 141
I would like to create a Random Forest model with caret. Since there are missing values in the training set, I was looking for possible solutions and came across the option "na.roughfix" from the package "randomForest". If the library randomForest is loaded, this option can be used as argument for the parameter "na.action" within the train function of caret. Inside the train function I use a 5-fold CV and tune for the best ROC value. I do this to ensure comparability between other models. The method I've chosen for the Random Forest is "ranger".
But now something strange happens: When I trigger the train function, the calculation is started, but for example the following error message appears:
model fit failed for Fold5: mtry= 7, splitrule=gini, min.node.size= 5 Error : Missing data in columns: ...
The "..." stands for the columns in which the missing values occur. Moreover, this error message always occurs, no matter for which fold or value for mtry.
I am well aware that there are missing values in these columns ... that's why I use na.roughfix. I also remove the NZVs, but that doesn't help either.
I would be very happy about an explanation or even a solution!
Many greetings
Edit.: I've seen now that, if I want to choose the "na.action" arugment in the train function, it does not appear automatically, which it usually does. It seems that it's somehow lost ... maybe this is the reason, why caret does not use the na.roughfix ...
Edit. 2: I guess that this is one part of the problem. train behaves always differently, depending on the previous arguments. In my train function I use a recipe from the recipe package to remove the NZVs. As soon as I remove the recipe, the na.action argument becomes available again. However, now the preProcess argument vanished, meaning I cannot remove the NZVs anymore. This is really a mess :-/ Is there a possibilty to apply the na.action AND the preProcess argument at the same time or any other solution for my Missing-Values-NZV-problem?
Edit. 3: As wished by the user missuse I try to provide you with a code expamle. Unfortunately I cannot provide you with data since mine is relatively sensitve - thank you for your understanding.
At first, I create a "blueprint" which I hand over to the train function. Here, I remove the Near Zero Variance Variables.
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors())
In the next step, I define the trainControl
train_control <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
verboseIter = TRUE)
and a grid:
hyper_grid <- expand.grid(mtry=c(1:(ncol(train_data)-1)),
splitrule = c("gini", "extratrees"),
min.node.size = c(1, 3, 5, 7, 10))
Finally, I put it all together into the train function:
tuned_rf <- train(
blueprint,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
Here, the argument na.action doesn't get suggested by R, meaning that is not available. This throws the error message in the opening question. However, if I remove the blueprint and write the model like this:
tuned_rf <- train(
target ~ .,
data = train_data,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
na.action = na.roughfix
)
na.action is available and na.roughfix can be used. However, now, the pre processing is missing. If I want to add the argument "preProcess =" to remove the NZVs, R does not suggest it, meaning that is not available anymore. Therefore, I would have to replace the fomula and the data with the training_data X and the response variable y. Now, preProcess is available again ... but na.action has vanished, therefore I cannot use na.roughfix.
tuned_rf <- train(
X,
Y,
method = "ranger",
metric = "ROC",
trControl = train_control,
tuneGrid = hyper_grid,
preProcess = "nzv"
)
Of course I could identify the NZVs first and remove them manually - but if I want to apply further steps, the whole process gets complicated.
I hope, my problem is now more understandable ...
Upvotes: 2
Views: 921
Reputation: 19756
From the help of ?randomForest::na.roughfix
just performs median/mode imputation you can replace it when using a recipe with step_impute_median
and step_impute_mode
your blueprint would look like:
library(recipes)
blueprint <- recipe(target ~ ., data = train_data) %>%
step_nzv(all_predictors()) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal())
Perhaps also try
blueprint <- recipe(target ~ ., data = train_data) %>%
step_impute_median(all_numeric()) %>%
step_impute_mode(all_nominal()) %:%
step_nzv(all_predictors())
Depending on how step_nzv
handles missing values.
I would also check performance with other imputing functions like
step_impute_bag
step_impute_knn
Upvotes: 3