Reputation: 39
I am new to R. I created the function below to calculate the mean of dataset contained in 332 csv files. Seek advice on how I could improve this code. It takes 38 sec to run which make me think it is not very efficient.
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names = TRUE) #creats list of files
dat <- data.frame() #creates empty dataframe
for(i in id){
dat<- rbind(dat,read.csv(files_list[i])) #combin all the monitor data together
}
good <- complete.cases(dat) #remove all NA values from dataset
mean(dat[good,pollutant]) #calculate mean
} #run time ~ 37sec - NEED TO OPTIMISE THE CODE
Upvotes: 3
Views: 168
Reputation: 31171
Instead of creating a void data.frame
and rbind
each time with a for loop
, you can store all data.frames
in a list and combine them in one shot. You can also use na.rm
option of mean function not to take into account NA
values.
pollutantmean <- function(directory, pollutant, id = 1:332)
{
files_list = list.files(directory, full.names = TRUE)[id]
df = do.call(rbind, lapply(files_list, read.csv))
mean(df[[pollutant]], na.rm=TRUE)
}
Optional - I would increase the readability with magrittr
:
library(magrittr)
pollutantmean <- function(directory, pollutant, id = 1:332)
{
list.files(directory, full.names = TRUE)[id] %>%
lapply(read.csv) %>%
do.call(rbind,.) %>%
extract2(pollutant) %>%
mean(na.rm=TRUE)
}
Upvotes: 4
Reputation: 21497
You can improve it by using data.table
's fread
function (see Quickly reading very large tables as dataframes in R)
Also binding the result using data.table::rbindlist
is way faster.
require(data.table)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list = list.files(directory, full.names = TRUE)[id]
DT = rbindlist(lapply(files_list, fread))
mean(DT[[pollutant]], na.rm=TRUE)
}
Upvotes: 1