M Doster
M Doster

Reputation: 37

dplyr::summarise() in an R function fails with "argument not numeric or logical" error

I am relatively new to R and I am attempting to write my first multi-step function. Essentially, I want to create a function that takes a directory and searches within that directory to find a certain column (in this case, pollutant). Then find the mean value of that column and remove the NAs. This is what I have so far:

pollutantmean <- function(directory , pollutant , min_id = 1, max_id = 332) {

setwd(directory)

dirdata <- list.files(path=getwd() , pattern='*.csv' , full.names = TRUE) %>% lapply(read_csv) %>% bind_rows

specdata <- dirdata %>% filter(between(ID,min_id,max_id))

polspecdata <- specdata %>% select(pollutant)

polspecdatamean <- polspecdata %>% summarize(mean_pollutant=mean(pollutant,na.rm=TRUE))
} 

I feel that I am so close, but the result is an error: Warning message:In mean.default(pollutant, na.rm = TRUE) : argument is not numeric or logical: returning NA. I believe the error is due to the column class being col_double. This may be due to dirdata is created from multiple csv files. Any help would be greatly appreciated. Thank you!

This is the data: zipfile_data

Upvotes: 0

Views: 1229

Answers (2)

Len Greski
Len Greski

Reputation: 10855

The code in the original post fails because it uses dplyr within a function, but does not use dplyr quoting functions. When we run the code through the RStudio debugger and stop at line 7, we see the following:

enter image description here

dplyr does not render the function argument within mean(pollutant, na.rm = TRUE) as expected, so line 9 fails. The mean() function fails because the pollutant argument renders as a text string, not a column in the polspecdata data frame.

One way to fix the error is to adjust line 9 to explicitly reference the data frame passed from the prior function via the %>% pipe operator, using the [[ form of the extract operator to use the string version of the argument.

polspecdatamean <- polspecdata %>% summarize(mean_pollutant=mean(.data[[pollutant]],na.rm=TRUE))

Finally, since the function should return the mean to the parent environment, we add a print of the object created in line 9 at the end of the function.

polspecdatamean

Since this is a programming assignment for the Johns Hopkins University R Programming course on Coursera, I won't post a complete answer because that violates the Coursera Honor Code.

Simplifying the solution

Once the data has been filtered in line 5, the function could simply return the mean as follows.

mean(specdata[[pollutant]],na.rm=TRUE)

Conclusions

For this particular assignment, use of dplyr makes the assignment more difficult than it needs to be due to the fact that dplyr uses non-standard evaluation and dplyr isn't even covered in the JHU curriculum until the third course in the sequence.

The code has some other subtle defects whose correction we'll leave as an exercise for the reader. For example, given the assignment requirements, the function should be able to handle the following inputs:

pollutantmean("specdata","sulfate",23) # calc mean for sensor 23
pollutantmean("specdata","nitrate",70:72) # calc mean for sensors 70 - 72 
pollutantmean("specdata","sulfate",c(3,5,7,9)) # calc mean for sensors 3, 5, 7, and 9 

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388907

Assuming you are passing pollutant variable as string try using the below function.

library(tidyverse)

pollutantmean <- function(directory , pollutant , min_id = 1, max_id = 332) {

  dirdata <- list.files(path=directory, pattern='*.csv' , full.names = TRUE) %>% 
                  map_df(read_csv)
   dirdata %>% 
      filter(between(ID,min_id,max_id)) %>%
      summarise(mean_pollutant= mean(!!sym(pollutant),na.rm=TRUE))
} 

So you can call it as

pollutantmean('/path', 'sulfate', 1, 10)

Using !!sym we evaluate sulfate as column and not as string.

Upvotes: 1

Related Questions