spindoctor
spindoctor

Reputation: 1895

Calculating medians via dplyr vs. aggregate in R

Hello: I am getting slightly different medians for a data set that looks like the one created below when I produce them via dplyr/ tidyr versus aggregate. Can anyone explain the difference? Thank you!

#dataset
out2<-structure(list(d3 = structure(c(1L, 2L, NA, NA, 1L, 1L, NA,  
2L,NA,3L,1L, NA, NA, 1L, 3L, NA, 1L, 2L, 3L, 2L, 1L, 3L, 2L, 3L, 1L), .Label 
=     c("Professional journalist", "Elected politician", "Online blogger"), 
class = "factor"), Accessible = c(3, 5, 2,NA, 1, 2, NA, 3, NA, 4, 2, 5, NA, 
3, 4, NA, 2, NA, 3, 4, 4, 4,2, 2, 2), Information = c(1, 2, 1, NA, 4, 1, NA, 
2, NA, 2, 1, 1, NA, 4, 1, NA, 1, 1, 1, 3, 1, 3, 3, 4, 1), Responsive = c(5, 
4, 6, NA, 2, 3, NA, 1, NA, 5, 4, 4, NA, 6, 3, NA, 4, NA, 2, 2, 6, 2, 1, 1, 
3), Debate = c(6, 3, 4, NA, 3, 4, NA, 5, NA, 6, 5,6, NA, 1, 5, NA, 5, 2, NA,
1, 5, 6, 5, 5, 7), Officials = c(2,1, 5, NA, 5, 5, NA, 6, NA, 3, 6, 2, NA, 2,
2, NA, 6, 3, NA, 5,2, 5, 4, 6, 5), Social = c(7, 6, 7, NA, 7, 7, NA, 4, NA,
7, 7,                                                                                                                                                                                                                                   
7, NA, 7, 7, NA, 7, NA, NA, 7, 7, 1, 6, 7, 6), `Trade-Offs` = c(4, 
7, 3, NA, 6, 6, NA, 7, NA, 1, 3, 3, NA, 5, 6, NA, 3, NA, NA,
6, 3, 7, 7, 3, 4)), .Names = c("d3", "Accessible", "Information",    
"Responsive", "Debate", "Officials", "Social", "Trade-Offs"), row.names = 
c(171L, 126L, 742L, 379L, 635L, 3L, 303L, 419L, 324L, 97L, 758L, 136L, 
770L, 405L, 101L, 674L, 386L, 631L, 168L, 590L, 731L, 387L, 673L, 208L, 
728L), class = "data.frame")

#Find Medians via tidyR and dplyr
test<-out2 %>%
gather(variable, value, -1) %>%
filter(is.na(d3)==FALSE)%>%
group_by(d3, variable) %>%
summarise(value=median(value, na.rm=TRUE))

#dataframe
test<-data.frame(test)

#find Medians via aggregate
test2<-aggregate(.~d3, data=out2, FUN=median, na.rm=TRUE)

#Gather for plotting
test2<-test2 %>% 
gather(variable, value, -d3)

#Plot Medians via tidyr
ggplot(test, aes(x=d3, y=value,    
group=d3))+facet_wrap(~variable)+
geom_bar(stat='identity')+labs(title='Medians via TidyR')

#Plot Medians Via aggregate
ggplot(test2, aes(x=d3,  y=value,    
group=d3))+facet_wrap(~variable)+geom_bar(stat='identity')+
labs(title='Medians via Aggregate')

#Compare Debate, Information and Responsive

Upvotes: 1

Views: 335

Answers (1)

Sam Firke
Sam Firke

Reputation: 23014

The results produced by aggregate are different because aggregate is dropping entire rows where any value is NA, even if some variables in that row contain data.

You can correct this by specifying a value for the na.action argument, as described in this accepted answer. Here it would be:

test2<-aggregate(.~d3, data=out2, FUN=median, na.rm = TRUE, na.action=NULL)
test2<-test2 %>% 
  gather(variable, value, -d3)

Confirm that the results are the same:

identical(as.data.frame(test %>% arrange(d3, variable, value)),
          as.data.frame(test2 %>% arrange(d3, variable, value)))
[1] TRUE

Upvotes: 2

Related Questions