Reputation: 869
I am working on R and learning how to code. I have written a piece of code, utilizing a for loop and I find it very slow. I was wondering if I can get some assistance to convert it to use either the sapply or lapply function. Here is my working R code:
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE) #creates a list of files
dat <- data.frame() #creates an empty data frame
for (i in seq_along(files_list)) {
#loops through the files, rbinding them together
dat <- rbind(dat, read.csv(files_list[i]))
}
dat_subset <- filter(dat, dat$ID %in% id) #subsets the rows that match the 'ID' argument
mean(dat_subset[, pollutant], na.rm=TRUE) #identifies the Mean of a Pollutant
}
pollutantmean("specdata", "sulfate", 1:10)
This code takes almost 20 seconds to return, which is unacceptable for 332 records. Imagine if I have a dataset with 10K records and wanted to get the mean of those variables?
Upvotes: 0
Views: 69
Reputation: 388807
The reason your code is slow because you are incrementally growing your dataframe in the loop. One way to do this using dplyr
and map_df
from purrr
can be
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
purrr::map_df(files_list, read.csv) %>%
filter(ID %in% id) %>%
summarise_at(pollutant, mean, na.rm = TRUE)
}
Upvotes: 0
Reputation: 1378
You can rbind
all elements in a list using do.call
, and you can read in all the files into that list using lapply
:
mean(
filter( # here's the filter that will be applied to the rbind-ed data
do.call("rbind", # call "rbind" on all elements of a list
lapply( # create a list by reading in the files from list.files()
# add any necessary args to read.csv:
list.files("[::DIR_PATH::]"), function(x) read.csv(file=x, ...)
)
)
), ID %in% id)$pollutant, # make sure id is replaced with what you want
na.rm = TRUE
)
Upvotes: 1