Reputation: 37
I am relatively new to R and I am attempting to write my first multi-step function. Essentially, I want to create a function that takes a directory and searches within that directory to find a certain column (in this case, pollutant). Then find the mean value of that column and remove the NAs. This is what I have so far:
pollutantmean <- function(directory , pollutant , min_id = 1, max_id = 332) {
setwd(directory)
dirdata <- list.files(path=getwd() , pattern='*.csv' , full.names = TRUE) %>% lapply(read_csv) %>% bind_rows
specdata <- dirdata %>% filter(between(ID,min_id,max_id))
polspecdata <- specdata %>% select(pollutant)
polspecdatamean <- polspecdata %>% summarize(mean_pollutant=mean(pollutant,na.rm=TRUE))
}
I feel that I am so close, but the result is an error: Warning message:In mean.default(pollutant, na.rm = TRUE) : argument is not numeric or logical: returning NA. I believe the error is due to the column class being col_double. This may be due to dirdata is created from multiple csv files. Any help would be greatly appreciated. Thank you!
This is the data: zipfile_data
Upvotes: 0
Views: 1229
Reputation: 10855
The code in the original post fails because it uses dplyr
within a function, but does not use dplyr
quoting functions. When we run the code through the RStudio debugger and stop at line 7, we see the following:
dplyr
does not render the function argument within mean(pollutant, na.rm = TRUE)
as expected, so line 9 fails. The mean()
function fails because the pollutant
argument renders as a text string, not a column in the polspecdata
data frame.
One way to fix the error is to adjust line 9 to explicitly reference the data frame passed from the prior function via the %>%
pipe operator, using the [[
form of the extract operator to use the string version of the argument.
polspecdatamean <- polspecdata %>% summarize(mean_pollutant=mean(.data[[pollutant]],na.rm=TRUE))
Finally, since the function should return the mean to the parent environment, we add a print of the object created in line 9 at the end of the function.
polspecdatamean
Since this is a programming assignment for the Johns Hopkins University R Programming course on Coursera, I won't post a complete answer because that violates the Coursera Honor Code.
Once the data has been filtered in line 5, the function could simply return the mean as follows.
mean(specdata[[pollutant]],na.rm=TRUE)
For this particular assignment, use of dplyr
makes the assignment more difficult than it needs to be due to the fact that dplyr
uses non-standard evaluation and dplyr
isn't even covered in the JHU curriculum until the third course in the sequence.
The code has some other subtle defects whose correction we'll leave as an exercise for the reader. For example, given the assignment requirements, the function should be able to handle the following inputs:
pollutantmean("specdata","sulfate",23) # calc mean for sensor 23
pollutantmean("specdata","nitrate",70:72) # calc mean for sensors 70 - 72
pollutantmean("specdata","sulfate",c(3,5,7,9)) # calc mean for sensors 3, 5, 7, and 9
Upvotes: 1
Reputation: 388907
Assuming you are passing pollutant
variable as string try using the below function.
library(tidyverse)
pollutantmean <- function(directory , pollutant , min_id = 1, max_id = 332) {
dirdata <- list.files(path=directory, pattern='*.csv' , full.names = TRUE) %>%
map_df(read_csv)
dirdata %>%
filter(between(ID,min_id,max_id)) %>%
summarise(mean_pollutant= mean(!!sym(pollutant),na.rm=TRUE))
}
So you can call it as
pollutantmean('/path', 'sulfate', 1, 10)
Using !!sym
we evaluate sulfate
as column and not as string.
Upvotes: 1