staove7
staove7

Reputation: 580

running multiple models using for-loop in r

I'm trying to run a loop which generates 5 random samples, and then 5 different RandomForest models.

I getting troubles over the second part (running the models); I can't approach the dependent variable (nam$eR in the following code):

numS <- 5 # number of samples
dataS <- ERC3
rfModels <- list()

for(j in 1:numS) {

print(j)
set.seed(j+1)
nam <- paste("RFs", j, sep = "")
assign(nam, dataS[sample(nrow(dataS),100000),]) # Random sample of 100,000 rows.

namM <- paste("RFfit", j, sep = "")
assign(namM, randomForest(as.factor(nam$eR)~., data=nam[,-231], importance = TRUE))

rfModels[[j]] <- namM

}

Thank you in advance!

Upvotes: 0

Views: 2734

Answers (2)

Nick Criswell
Nick Criswell

Reputation: 1743

I am not sure if this will work exactly for your case since I don't have sample data, but if you were to do what I'm thinking you are looking for with the mtcars data set, it would be something like this...First, it might be best to have a list of data frames to house the data you are running the model on. This can be done as follows:

library(dplyr)
library(randomForest)

dfs <- list() #home for the list of dataframes on which to run a randomforest

set.seed(1)
for(i in 1:5){
  dfs[[i]] <- sample_n(mtcars, size = 10, replace = FALSE)
}

(Per the comments, a slicker way to do this would be to go with

  dfs_slicker_approach <- lapply(seq(5), 
                                 function(i) sample_n(mtcars, size = 10, replace = FALSE))

)

The dfs list now contains a list of data.frames which contain 10 randomly selected rows from the mtcars data set. (Obviously, you'll want to update this to fit your needs.)

Then we run the randomForest function on this list using the lapply function as follows:

rfs <- lapply(dfs, function(m) randomForest(mpg ~ ., 
                                            data = m, importance = TRUE ))

Again, change the syntax to select the columns you are interested in predicting on. The rfs list now contains all of our randomForest objects. You can again access these using lapply. For instnace, if we want the predicted values, we can do this as follows: (We'll subset to only the first set of predictions to avoid printing a a lot of info)

> lapply(rfs, as.data.frame(predict))[1]
[[1]]
                       value
Merc 230            22.85464
Merc 450SE          17.61810
Fiat 128            22.31571
Porsche 914-2       23.95909
Valiant             21.28786
Pontiac Firebird    15.93824
Ford Pantera L      21.20373
Chrysler Imperial   14.40740
Lincoln Continental 16.43074
Mazda RX4 Wag       21.18467

Upvotes: 2

Jake Kaupp
Jake Kaupp

Reputation: 8072

While not deviating from Nick's solution, here is an approach using the tidyverse workflow. Highlights are: readable code via pipes, using dplyr verbs and purrr functionals and keeping data, models and predictions in a nice tidy tibble.

library(randomForest)
library(tidyverse)

set.seed(42)

analysis <- rerun(5, sample_n(mtcars, size = 10, replace = FALSE)) %>% 
  tibble(data = .) %>% 
  rownames_to_column("model_number") %>% 
  mutate(models = map(data, ~randomForest(mpg ~ ., data = .x, importance = TRUE))) %>% 
  mutate(predict = map(models, ~predict(.x)))

You can then get what you want when you need it....

comparison <-  analysis %>% 
mutate(actual = map(data, "mpg")) %>% 
unnest(predict, actual)

comparison

# A tibble: 50 × 3
   model_number  predict actual
          <chr>    <dbl>  <dbl>
1             1 14.10348   14.7
2             1 16.78987   15.0
3             1 15.14636   17.3
4             1 15.81265   15.5
5             1 24.11492   21.5
6             1 24.24701   22.8
7             1 15.84953   10.4
8             1 21.72781   32.4
9             1 21.78105   21.0
10            1 15.58614   16.4
# ... with 40 more rows

... and see the results easily.

ggplot(comparison, aes(actual, predict)) +
  geom_point() +
  facet_wrap(~model_number, nrow = 1)

enter image description here

Upvotes: 2

Related Questions