MeC
MeC

Reputation: 463

How to group_by() and subset within mutate?

I'm looking for a way to make a computation over two grouped_by variables:

 Age <- sample(c("4", "5", "6", "adult"), 20, replace = TRUE)
 letter <- sample(c("a", "c", "d"), 20, replace = TRUE)
 measurement <- sample(1.5:50.5, 20, replace = TRUE)

 df <- data_frame(Age, letter, measurement)

I want to group_by Age and letter

 df2 <- df %>%
     group_by(Age, letter) 

but then I want to calculate the difference between the median measurement from one subset of Age and another:

 df2 <- df %>%
     group_by(Age, letter) %>%
     mutate(diff = median(measurement[Age=='adult']) - median(measurement[Age!='adult']))

I want the difference between 'measurement' (from adults) and 'measurement' (from each age group) for each age group and letter combination. I currently generate NAs; my approach is not correct. There must be a better way.

Upvotes: 2

Views: 1306

Answers (2)

Georgery
Georgery

Reputation: 8117

You can calculate the median for the adults first:

adultMedian <- df %>%
    filter(Age == "adult") %>%
    summarise(adultMedian = median(measurement)) %>%
    pull()

df %>%
    group_by(Age, letter) %>%
    mutate(diff = median(measurement) - adultMedian)

Which results in

   Age   letter measurement  diff
   <chr> <chr>        <dbl> <dbl>
 1 5     a              9.5 -15  
 2 adult c             24.5  -5  
 3 5     c             12.5 -12  
 4 6     d             18.5   6  
 5 adult a             27.5   3  
 6 adult d             37.5   3.5
 7 4     c             11.5   0.5
 8 6     d             31.5   6  
 9 5     c             32.5 -12  
10 6     c             18.5  -6  
11 5     d             49.5  25  
12 4     d             50.5  26  
13 4     c             38.5   0.5
14 6     d             30.5   6  
15 adult c              4.5  -5  
16 adult c             14.5  -5  
17 5     c              7.5 -12  
18 4     a             24.5   0  
19 adult c             49.5  -5  
20 adult d             18.5   3.5

Upvotes: 0

Dan Chaltiel
Dan Chaltiel

Reputation: 8484

If I understood your question correctly, you want to compute a difference between a fixed value (the median among adults) and a value that varies across groups.

Since the dataframe is grouped, you need to use the original dataframe in the calculus. Also, as you want only one value for each group, you don't want to mutate but to summarise:

df %>%
  group_by(Age, letter) %>%
  summarise(diff = median(measurement) - median(df$measurement[df$Age=='adult'])

Alternatively, if you want to stick to the dplyr pipeline, you could first mutate a dummy variable which holds the median, and then use the first occurrence of this variable in the summarise call.

df %>%
  #group_by(letter) %>% #might also be interesting
  mutate(dummy=median(measurement[Age=='adult'])) %>% 
  group_by(Age, letter) %>%
  summarise(diff = median(measurement) - dummy[1]))

This might be less optimized, but it allows to group before calculating the fixed median, which might be interesting too.

Upvotes: 1

Related Questions