AHartmann
AHartmann

Reputation: 23

for loop in R to run function on multiple files, calculate value from output, place in new file

I'm trying to use a for loop to run random forest on multiple input files in sequence, calculate the OOB error of the resulting rf object for each of those files (based on 5000 trees), and output those OOB error values into one results file. The results file is returning the exact same OOB values for each file, which is not correct (i.e., it's returning the OOB value for only one of my input files). I've tried the following:

fileNames = list.files(pattern="\\.csv")

for(fileName in fileNames){

  sample = read.csv(fileName, header=TRUE, sep=",")

  rf_rand = randomForest(
    sample[,3:45], 
    sample$Organism, 
    proximity=TRUE,
    importance=TRUE, 
    ntree=5000)

  OOB = mean(rf_rand$err.rate[,1])

  results = data.frame(fileNames,OOB)

  write.table(results,"rand_oob_reps.txt",sep = "\t")     
}

results

#1 sample1.csv 0.06764769
#2 sample2.csv 0.06764769
#3 sample3.csv 0.06764769

I have also tried unsuccessfully with:

for(i in 1:length(fileNames))

This seems like a simple issue, but so far my search for answers has come up empty. Thanks for any insights.

Upvotes: 2

Views: 375

Answers (3)

Parfait
Parfait

Reputation: 107567

You are writing to the same file with each iteration. Also, in creating dataframe, you are passing your entire list, fileNames and not individual filename: results = data.frame(fileNames,OOB)

Consider the below lapply() solution that 1) creates a column for current file name as last column in df, 2) iteratively saves .txt files suffixed with original file name and 3) creates one list of many results dataframes compiled from all iterations:

fileNames = list.files(pattern="\.csv")

OOBresults <- lapply(fileNames, function(file) {
       # READ IN FILE 
       sample <- read.csv(file, header=TRUE, sep=",")

       # CALCULATE RF RESULTS
       rf_rand <- randomForest(sample[,3:45], sample$Organism, 
                               proximity=TRUE, importance=TRUE, ntree=5000)        
       OOB <- mean(rf_rand$err.rate[,1])

       # CREATE DATAFRAME
       results <- data.frame(OOB)
       results$filename <- file

       # OUTPUT TO FILE
       file <- gsub(".csv", "", file)       # REMOVE .csv EXTENSION
       write.table(results, paste0("rand_oob_reps_", file, ".txt"), sep = "\t")

       # SAVE DF AS NEW ELEMENT IN LIST
       return(results)
}

If you intended to dump one file of all results to .txt, then take new dataframe list from above (OOBresults), run a do.call(rbind, list) and output to file, all outside the loop:

resultsdf <- do.call(rbind, OOBresults)    # ASSUMING SAME STRUCTURED DFs

write.table(resultsdf, "rand_oob_reps.txt", sep = "\t")

Upvotes: 0

Sandipan Dey
Sandipan Dey

Reputation: 23101

First intialize a NULL results dataframe at the beginning outside the for loop:

results <- NULL

change the line

results = data.frame(fileNames,OOB)

inside the for loop to

results = rbind(results, data.frame(fileName,OOB))

bring the following line where you are writing the OOB results to file outside the loop when it ends.

write.table(results,"rand_oob_reps.txt",sep = "\t")

so the code looks like the following now:

fileNames = list.files(pattern="\.csv")
results <- NULL

for(fileName in fileNames) {

  sample = read.csv(fileName, header=TRUE, sep=",")
  rf_rand = randomForest(sample[,3:45], sample$Organism, 
                         proximity=TRUE,importance=TRUE, ntree=5000)
  OOB = mean(rf_rand$err.rate[,1])
  results = rbind(results, data.frame(fileName, OOB))

}

write.table(results,"rand_oob_reps.txt",sep = "\t")

results

Are you sure that the content of sample is different everytime (input files are read correctly)?

Upvotes: 2

gambrel
gambrel

Reputation: 56

Your results = data.frame(fileNames, OOB) line isn't doing what you think it is. fileNames is a list, and OOB is a single value (computed within the loop). So each time the loop runs and you save the dataframe with write.table(), it is overwriting the previous version - the first column is the list of fileNames, and the second is the single OOB output for that loop, repeated for the length of the dataset. What you're seeing in the OOB column in the output table is the result from the very last iteration of the loop.

You might try initializing an empty dataframe before the loop:

results = data.frame()

then, change the part in your loop to only add one row in each iteration:

results <- rbind(results, c(fileName, OOB))

and finally, move the write.table() call to after the loop runs.

Upvotes: 0

Related Questions