Group medians from a data frame using dplyr

Question

Computing medians seems to be a bit of an achilles heel for R (ie. no data.frame method). What is the least amount of typing needed to get group medians from a data frame using dplyr?

my_data <- structure(list(group = c("Group 1", "Group 1", "Group 1", "Group 1", 
"Group 1", "Group 1", "Group 1", "Group 1", "Group 1", "Group 1", 
"Group 1", "Group 1", "Group 1", "Group 1", "Group 1", "Group 2", 
"Group 2", "Group 2", "Group 2", "Group 2", "Group 2", "Group 2", 
"Group 2", "Group 2", "Group 2", "Group 2", "Group 2", "Group 2", 
"Group 2", "Group 2"), value = c("5", "3", "6", "8", "10", "13", 
"1", "4", "18", "4", "7", "9", "14", "15", "17", "7", "3", "9", 
"10", "33", "15", "18", "6", "20", "30", NA, NA, NA, NA, NA)), .Names = c("group", 
"value"), class = c("tbl_df", "data.frame"), row.names = c(NA, 
-30L))

library(dplyr)  

# groups 1 & 2
my_data_groups_1_and_2 <- my_data[my_data$group %in% c("Group 1", "Group 2"), ]

# compute medians per group
medians <- my_data_groups_1_and_2 %>%
  group_by(group) %>%
  summarize(the_medians = median(value, na.rm = TRUE))

Which gives:

Error in summarise_impl(.data, dots) : 
  STRING_ELT() can only be applied to a 'character vector', not a 'double'
In addition: Warning message:
In mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]) :
  argument is not numeric or logical: returning NA

What is the least effort workaround here?

talat · Accepted Answer

As commented by ivyleavedtoadflax, the error is caused by supplying a non-numeric or non-logical argument to median, since your value column is of type character (you can easily tell that they are not numeric by seeing that the numbers are quoted). Here are two simple ways to solve it:

my_data %>% 
  filter(group %in% c("Group 1", "Group 2")) %>%
  group_by(group) %>%
  summarize(the_medians = median(as.numeric(value), na.rm = TRUE))

Or

my_data %>% 
  filter(group %in% c("Group 1", "Group 2")) %>%
  mutate(value = as.numeric(value))  %>%
  group_by(group) %>%
  summarize(the_medians = median(value, na.rm = TRUE))

To check the structure including type of columns in your data, you could conveniently use

str(my_data)
#Classes ‘tbl_df’ and 'data.frame': 30 obs. of  2 variables:
# $ group: chr  "Group 1" "Group 1" "Group 1" "Group 1" ...
# $ value: chr  "5" "3" "6" "8" ...

Group medians from a data frame using dplyr

Answers (1)

Related Questions