Reputation: 197
I am trying to use the summarize function within dplyr to calculate summary statistics using a two argument function that passes a table and field name from a connected database. Unfortunately as soon as I wrap the summarize function with another function the results aren't correct. The end table is a dataframe that does not iterate through each row. I'll show the input/output below:
Summary Statistics Function library(dplyr)
data<-iris
data<- group_by(.data = data,Species)
SummaryStatistics <- function(table, field){
table %>%
summarise(count = n(),
min = min(table[[field]], na.rm = T),
mean = mean(table[[field]], na.rm = T, trim=0.05),
median = median(table[[field]], na.rm = T))
}
SummaryStatistics(data, "Sepal.Length")
Output Table--Incorrect, it's just repeating the same calculation
Species count min mean median
1 setosa 50 4.3 5.820588 5.8
2 versicolor 50 4.3 5.820588 5.8
3 virginica 50 4.3 5.820588 5.8
Correct Table/Desired Outcome--This is what the table should look like. When I run the summarize function outsize of the wrapper function, this is what it produces.
Species count min mean median
1 setosa 50 4.3 5.002174 5.0
2 versicolor 50 4.9 5.934783 5.9
3 virginica 50 4.9 6.593478 6.5
I hope this is easy to understand. I just can't grasp as to why the summary statistics work perfectly outside of the wrapper function, but as soon as I pass arguments to it, it will calculate the same thing for each row. Any help would be greatly appreciated.
Thanks, Kev
Upvotes: 4
Views: 5609
Reputation: 8072
You need to use Non-Standard Evaluation (NSE) to use dplyr
functions programmatically alongside lazyeval
. The dplyr
NSE vignette covers it fairly well.
library(dplyr)
library(lazyeval)
data <- group_by(iris, Species)
SummaryStatistics <- function(table, field){
table %>%
summarise_(count = ~n(),
min = interp(~min(var, na.rm = T), var = as.name(field)),
mean = interp(~mean(var, na.rm = T, trim=0.05), var = as.name(field)),
median = interp(~median(var, na.rm = T), var = as.name(field)))
}
SummaryStatistics(data, "Sepal.Length")
# A tibble: 3 × 5
Species count min mean median
<fctr> <int> <dbl> <dbl> <dbl>
1 setosa 50 4.3 5.002174 5.0
2 versicolor 50 4.9 5.934783 5.9
3 virginica 50 4.9 6.593478 6.5
Upvotes: 13