Reputation: 580
I'm trying to run a loop which generates 5 random samples, and then 5 different RandomForest models.
I getting troubles over the second part (running the models); I can't approach the dependent variable (nam$eR
in the following code):
numS <- 5 # number of samples
dataS <- ERC3
rfModels <- list()
for(j in 1:numS) {
print(j)
set.seed(j+1)
nam <- paste("RFs", j, sep = "")
assign(nam, dataS[sample(nrow(dataS),100000),]) # Random sample of 100,000 rows.
namM <- paste("RFfit", j, sep = "")
assign(namM, randomForest(as.factor(nam$eR)~., data=nam[,-231], importance = TRUE))
rfModels[[j]] <- namM
}
Thank you in advance!
Upvotes: 0
Views: 2734
Reputation: 1743
I am not sure if this will work exactly for your case since I don't have sample data, but if you were to do what I'm thinking you are looking for with the mtcars
data set, it would be something like this...First, it might be best to have a list of data frames to house the data you are running the model on. This can be done as follows:
library(dplyr)
library(randomForest)
dfs <- list() #home for the list of dataframes on which to run a randomforest
set.seed(1)
for(i in 1:5){
dfs[[i]] <- sample_n(mtcars, size = 10, replace = FALSE)
}
(Per the comments, a slicker way to do this would be to go with
dfs_slicker_approach <- lapply(seq(5),
function(i) sample_n(mtcars, size = 10, replace = FALSE))
)
The dfs
list now contains a list of data.frames
which contain 10 randomly selected rows from the mtcars
data set. (Obviously, you'll want to update this to fit your needs.)
Then we run the randomForest
function on this list using the lapply
function as follows:
rfs <- lapply(dfs, function(m) randomForest(mpg ~ .,
data = m, importance = TRUE ))
Again, change the syntax to select the columns you are interested in predicting on. The rfs
list now contains all of our randomForest
objects. You can again access these using lapply
. For instnace, if we want the predicted values, we can do this as follows: (We'll subset to only the first set of predictions to avoid printing a a lot of info)
> lapply(rfs, as.data.frame(predict))[1]
[[1]]
value
Merc 230 22.85464
Merc 450SE 17.61810
Fiat 128 22.31571
Porsche 914-2 23.95909
Valiant 21.28786
Pontiac Firebird 15.93824
Ford Pantera L 21.20373
Chrysler Imperial 14.40740
Lincoln Continental 16.43074
Mazda RX4 Wag 21.18467
Upvotes: 2
Reputation: 8072
While not deviating from Nick's solution, here is an approach using the tidyverse
workflow. Highlights are: readable code via pipes, using dplyr
verbs and purrr
functionals and keeping data, models and predictions in a nice tidy tibble.
library(randomForest)
library(tidyverse)
set.seed(42)
analysis <- rerun(5, sample_n(mtcars, size = 10, replace = FALSE)) %>%
tibble(data = .) %>%
rownames_to_column("model_number") %>%
mutate(models = map(data, ~randomForest(mpg ~ ., data = .x, importance = TRUE))) %>%
mutate(predict = map(models, ~predict(.x)))
You can then get what you want when you need it....
comparison <- analysis %>%
mutate(actual = map(data, "mpg")) %>%
unnest(predict, actual)
comparison
# A tibble: 50 × 3
model_number predict actual
<chr> <dbl> <dbl>
1 1 14.10348 14.7
2 1 16.78987 15.0
3 1 15.14636 17.3
4 1 15.81265 15.5
5 1 24.11492 21.5
6 1 24.24701 22.8
7 1 15.84953 10.4
8 1 21.72781 32.4
9 1 21.78105 21.0
10 1 15.58614 16.4
# ... with 40 more rows
... and see the results easily.
ggplot(comparison, aes(actual, predict)) +
geom_point() +
facet_wrap(~model_number, nrow = 1)
Upvotes: 2