create dataframe of difference of medians in column based on values of another column

Question

I have a dataframe which looks like this:

data <- data.frame(id=c(1,2,6,3,7,1,5,7),
 class=c('apple','boy','boy','apple','boy','apple','apple','boy'), 
type=c('type1','type1','type2','type2','type3','type4','type4','type4'), 
col1=c(-0.9,0.8,0.7,-0.6,-0.5,0.4,0.3,0.9), col2=c(-6.9,2.8,0.4,-1.6,-0.8,0.6,0.2,-0.1), 
col3=c(6.7,0.9,0.2,-0.7,-0.8,1.6,3.2,0.1))

id class  type col1 col2 col3
1 apple type1 -0.9 -6.9  6.7
2   boy type1  0.8  2.8  0.9
6   boy type2  0.7  0.4  0.2
3 apple type2 -0.6 -1.6 -0.7
7   boy type3 -0.5 -0.8 -0.8
1 apple type4  0.4  0.6  1.6
5 apple type4  0.3  0.2  3.2
7   boy type4  0.9 -0.1  0.1

I am trying to create a dataframe which has the same columns (i.e., col1, col2, col3, ...) but the values in it should be median((data %>% filter(class=="apple"))$col1) - median((data %>% filter(class=="boy"))$col1) and so on for each type for each column.

So, the final dataframe will look like

  type col1 col2 col3
type1 -0.1 -4.1  3.7
type2  0.7  0.4  0.2
type3 -0.5 -0.8 -0.8
type4  0.4  0.6  1.6

I can do this by creating individual dataframes of each type and calculating the difference of medians of both the classes and append the vector to an empty dataframe with bind_rows().

But is there any better and easier method to do this?

r2evans · Accepted Answer

The method you want is something like this:

data %>%
  group_by(type) %>%
  summarize(across(col1:col3, ~ median(.[class=="boy"] - median(.[class=="boy"]))))
# # A tibble: 4 x 4
#   type   col1  col2  col3
#      
# 1 type1     0     0     0
# 2 type2     0     0     0
# 3 type3     0     0     0
# 4 type4     0     0     0

though in this instance it will return all 0s because you only have one "boy" within each group.

Post question-edit, here's the updated code and results:

data %>%
 group_by(type) %>%
 summarize(across(col1:col3, ~ median(.[class=="apple"]) - median(.[class=="boy"])))
# # A tibble: 4 x 4
#   type     col1   col2    col3
#           
# 1 type1 -1.7    -9.700  5.8   
# 2 type2 -1.3000 -2     -0.9000
# 3 type3 NA      NA     NA     
# 4 type4 -0.55    0.5    2.3

The NAs are because type3 only has "boy", no "apple".

(At least we aren't comparing "apple" to "orange", that would have been rather cliché ;-)

create dataframe of difference of medians in column based on values of another column

Answers (2)

Related Questions