Calculating medians via dplyr vs. aggregate in R

Question

Hello: I am getting slightly different medians for a data set that looks like the one created below when I produce them via dplyr/ tidyr versus aggregate. Can anyone explain the difference? Thank you!

#dataset
out2<-structure(list(d3 = structure(c(1L, 2L, NA, NA, 1L, 1L, NA,  
2L,NA,3L,1L, NA, NA, 1L, 3L, NA, 1L, 2L, 3L, 2L, 1L, 3L, 2L, 3L, 1L), .Label 
=     c("Professional journalist", "Elected politician", "Online blogger"), 
class = "factor"), Accessible = c(3, 5, 2,NA, 1, 2, NA, 3, NA, 4, 2, 5, NA, 
3, 4, NA, 2, NA, 3, 4, 4, 4,2, 2, 2), Information = c(1, 2, 1, NA, 4, 1, NA, 
2, NA, 2, 1, 1, NA, 4, 1, NA, 1, 1, 1, 3, 1, 3, 3, 4, 1), Responsive = c(5, 
4, 6, NA, 2, 3, NA, 1, NA, 5, 4, 4, NA, 6, 3, NA, 4, NA, 2, 2, 6, 2, 1, 1, 
3), Debate = c(6, 3, 4, NA, 3, 4, NA, 5, NA, 6, 5,6, NA, 1, 5, NA, 5, 2, NA,
1, 5, 6, 5, 5, 7), Officials = c(2,1, 5, NA, 5, 5, NA, 6, NA, 3, 6, 2, NA, 2,
2, NA, 6, 3, NA, 5,2, 5, 4, 6, 5), Social = c(7, 6, 7, NA, 7, 7, NA, 4, NA,
7, 7,                                                                                                                                                                                                                                   
7, NA, 7, 7, NA, 7, NA, NA, 7, 7, 1, 6, 7, 6), `Trade-Offs` = c(4, 
7, 3, NA, 6, 6, NA, 7, NA, 1, 3, 3, NA, 5, 6, NA, 3, NA, NA,
6, 3, 7, 7, 3, 4)), .Names = c("d3", "Accessible", "Information",    
"Responsive", "Debate", "Officials", "Social", "Trade-Offs"), row.names = 
c(171L, 126L, 742L, 379L, 635L, 3L, 303L, 419L, 324L, 97L, 758L, 136L, 
770L, 405L, 101L, 674L, 386L, 631L, 168L, 590L, 731L, 387L, 673L, 208L, 
728L), class = "data.frame")

#Find Medians via tidyR and dplyr
test<-out2 %>%
gather(variable, value, -1) %>%
filter(is.na(d3)==FALSE)%>%
group_by(d3, variable) %>%
summarise(value=median(value, na.rm=TRUE))

#dataframe
test<-data.frame(test)

#find Medians via aggregate
test2<-aggregate(.~d3, data=out2, FUN=median, na.rm=TRUE)

#Gather for plotting
test2<-test2 %>% 
gather(variable, value, -d3)

#Plot Medians via tidyr
ggplot(test, aes(x=d3, y=value,    
group=d3))+facet_wrap(~variable)+
geom_bar(stat='identity')+labs(title='Medians via TidyR')

#Plot Medians Via aggregate
ggplot(test2, aes(x=d3,  y=value,    
group=d3))+facet_wrap(~variable)+geom_bar(stat='identity')+
labs(title='Medians via Aggregate')

#Compare Debate, Information and Responsive

Sam Firke · Accepted Answer

The results produced by aggregate are different because aggregate is dropping entire rows where any value is NA, even if some variables in that row contain data.

You can correct this by specifying a value for the na.action argument, as described in this accepted answer. Here it would be:

test2<-aggregate(.~d3, data=out2, FUN=median, na.rm = TRUE, na.action=NULL)
test2<-test2 %>% 
  gather(variable, value, -d3)

Confirm that the results are the same:

identical(as.data.frame(test %>% arrange(d3, variable, value)),
          as.data.frame(test2 %>% arrange(d3, variable, value)))
[1] TRUE

Calculating medians via dplyr vs. aggregate in R

Answers (1)

Related Questions