Reputation: 463
I'm looking for a way to make a computation over two grouped_by variables:
Age <- sample(c("4", "5", "6", "adult"), 20, replace = TRUE)
letter <- sample(c("a", "c", "d"), 20, replace = TRUE)
measurement <- sample(1.5:50.5, 20, replace = TRUE)
df <- data_frame(Age, letter, measurement)
I want to group_by Age and letter
df2 <- df %>%
group_by(Age, letter)
but then I want to calculate the difference between the median measurement from one subset of Age and another:
df2 <- df %>%
group_by(Age, letter) %>%
mutate(diff = median(measurement[Age=='adult']) - median(measurement[Age!='adult']))
I want the difference between 'measurement' (from adults) and 'measurement' (from each age group) for each age group and letter combination. I currently generate NAs; my approach is not correct. There must be a better way.
Upvotes: 2
Views: 1306
Reputation: 8117
You can calculate the median for the adults first:
adultMedian <- df %>%
filter(Age == "adult") %>%
summarise(adultMedian = median(measurement)) %>%
pull()
df %>%
group_by(Age, letter) %>%
mutate(diff = median(measurement) - adultMedian)
Which results in
Age letter measurement diff
<chr> <chr> <dbl> <dbl>
1 5 a 9.5 -15
2 adult c 24.5 -5
3 5 c 12.5 -12
4 6 d 18.5 6
5 adult a 27.5 3
6 adult d 37.5 3.5
7 4 c 11.5 0.5
8 6 d 31.5 6
9 5 c 32.5 -12
10 6 c 18.5 -6
11 5 d 49.5 25
12 4 d 50.5 26
13 4 c 38.5 0.5
14 6 d 30.5 6
15 adult c 4.5 -5
16 adult c 14.5 -5
17 5 c 7.5 -12
18 4 a 24.5 0
19 adult c 49.5 -5
20 adult d 18.5 3.5
Upvotes: 0
Reputation: 8484
If I understood your question correctly, you want to compute a difference between a fixed value (the median among adults) and a value that varies across groups.
Since the dataframe is grouped, you need to use the original dataframe in the calculus. Also, as you want only one value for each group, you don't want to mutate
but to summarise
:
df %>%
group_by(Age, letter) %>%
summarise(diff = median(measurement) - median(df$measurement[df$Age=='adult'])
Alternatively, if you want to stick to the dplyr
pipeline, you could first mutate a dummy variable which holds the median, and then use the first occurrence of this variable in the summarise call.
df %>%
#group_by(letter) %>% #might also be interesting
mutate(dummy=median(measurement[Age=='adult'])) %>%
group_by(Age, letter) %>%
summarise(diff = median(measurement) - dummy[1]))
This might be less optimized, but it allows to group before calculating the fixed median, which might be interesting too.
Upvotes: 1