Reputation: 23
I'm trying to use a for loop to run random forest on multiple input files in sequence, calculate the OOB error of the resulting rf object for each of those files (based on 5000 trees), and output those OOB error values into one results file. The results file is returning the exact same OOB values for each file, which is not correct (i.e., it's returning the OOB value for only one of my input files). I've tried the following:
fileNames = list.files(pattern="\\.csv")
for(fileName in fileNames){
sample = read.csv(fileName, header=TRUE, sep=",")
rf_rand = randomForest(
sample[,3:45],
sample$Organism,
proximity=TRUE,
importance=TRUE,
ntree=5000)
OOB = mean(rf_rand$err.rate[,1])
results = data.frame(fileNames,OOB)
write.table(results,"rand_oob_reps.txt",sep = "\t")
}
results
#1 sample1.csv 0.06764769
#2 sample2.csv 0.06764769
#3 sample3.csv 0.06764769
I have also tried unsuccessfully with:
for(i in 1:length(fileNames))
This seems like a simple issue, but so far my search for answers has come up empty. Thanks for any insights.
Upvotes: 2
Views: 375
Reputation: 107567
You are writing to the same file with each iteration. Also, in creating dataframe, you are passing your entire list, fileNames
and not individual filename: results = data.frame(fileNames,OOB)
Consider the below lapply()
solution that 1) creates a column for current file name as last column in df, 2) iteratively saves .txt files suffixed with original file name and 3) creates one list of many results dataframes compiled from all iterations:
fileNames = list.files(pattern="\.csv")
OOBresults <- lapply(fileNames, function(file) {
# READ IN FILE
sample <- read.csv(file, header=TRUE, sep=",")
# CALCULATE RF RESULTS
rf_rand <- randomForest(sample[,3:45], sample$Organism,
proximity=TRUE, importance=TRUE, ntree=5000)
OOB <- mean(rf_rand$err.rate[,1])
# CREATE DATAFRAME
results <- data.frame(OOB)
results$filename <- file
# OUTPUT TO FILE
file <- gsub(".csv", "", file) # REMOVE .csv EXTENSION
write.table(results, paste0("rand_oob_reps_", file, ".txt"), sep = "\t")
# SAVE DF AS NEW ELEMENT IN LIST
return(results)
}
If you intended to dump one file of all results to .txt, then take new dataframe list from above (OOBresults
), run a do.call(rbind, list)
and output to file, all outside the loop:
resultsdf <- do.call(rbind, OOBresults) # ASSUMING SAME STRUCTURED DFs
write.table(resultsdf, "rand_oob_reps.txt", sep = "\t")
Upvotes: 0
Reputation: 23101
First intialize a NULL results dataframe at the beginning outside the for loop:
results <- NULL
change the line
results = data.frame(fileNames,OOB)
inside the for loop to
results = rbind(results, data.frame(fileName,OOB))
bring the following line where you are writing the OOB results to file outside the loop when it ends.
write.table(results,"rand_oob_reps.txt",sep = "\t")
so the code looks like the following now:
fileNames = list.files(pattern="\.csv")
results <- NULL
for(fileName in fileNames) {
sample = read.csv(fileName, header=TRUE, sep=",")
rf_rand = randomForest(sample[,3:45], sample$Organism,
proximity=TRUE,importance=TRUE, ntree=5000)
OOB = mean(rf_rand$err.rate[,1])
results = rbind(results, data.frame(fileName, OOB))
}
write.table(results,"rand_oob_reps.txt",sep = "\t")
results
Are you sure that the content of sample is different everytime (input files are read correctly)?
Upvotes: 2
Reputation: 56
Your results = data.frame(fileNames, OOB)
line isn't doing what you think it is. fileNames
is a list, and OOB
is a single value (computed within the loop). So each time the loop runs and you save the dataframe with write.table()
, it is overwriting the previous version - the first column is the list of fileNames
, and the second is the single OOB output for that loop, repeated for the length of the dataset. What you're seeing in the OOB column in the output table is the result from the very last iteration of the loop.
You might try initializing an empty dataframe before the loop:
results = data.frame()
then, change the part in your loop to only add one row in each iteration:
results <- rbind(results, c(fileName, OOB))
and finally, move the write.table()
call to after the loop runs.
Upvotes: 0