Reputation: 1644
How to use summarise to take the value of a variable that corresponds to the maximum value of another variable?
Data: I have a simplified dataset below.
df <- read.table(text = "
ID SBP DATE
1 90 20210102
1 106 20210111
2 80 20210513
2 87 20210513
2 88 20210413", header = TRUE)
I wish to take the value of SBP
, which corresponds to the latest DATE
(i.e. most recent measurement of systolic blood pressure). There may be ties, i.e. > 1 measurement within the same day (as shown in ID=2
) and in such case, I would like to take the first line. Besides this, I may need to obtain other variables, like mean of SBP, no. of measurements of SBP etc. Hence, I would like to use only summarise()
. Below is the desired output.
Desired output:
df <- read.table(text = "
ID SBP
1 106
2 80", header = TRUE)
Here's what I have done previously.
1) Using summarise
with [
and which.max
df %>% group_by(ID) %>% summarise(SBP = SBP[which.max(DATE)])
## A tibble: 2 x 2
# ID SBP
# <int> <int>
#1 1 106
#2 2 80
2) Using slice_max
df %>% group_by(ID) %>% slice_max(DATE, with_ties = FALSE)
## A tibble: 2 x 2
# ID SBP
# <int> <int>
#1 1 106
#2 2 80
3) Using summarise
with last
df %>% group_by(ID) %>% summarise(SBP = last(SBP, DATE))
## A tibble: 2 x 2
# ID SBP
# <int> <int>
#1 1 106
#2 2 87
I think (3) is ideal in terms of readability, but instead of taking the first line item, it takes the last line item (not what I want). If I use (2), I would have to use mutate
to create other variables of interest (like number of measurements, mean etc.) before using slice_max
. (1) would be confusing to other R readers/users.
My question: How can I write something like (3), but taking the first line, when there are ties?
Upvotes: 1
Views: 691
Reputation: 2174
I would use either 1) arrange
+ distinct
or 2) group_by
+ summarise
+ first
. The first approach is not as readable, but with big datasets it's actually more performant than using group by.
library(tidyverse)
df %>%
arrange(ID, -DATE) %>%
distinct(ID, .keep_all = TRUE)
#> ID SBP DATE
#> 1 1 106 20210111
#> 2 2 80 20210513
df %>%
group_by(ID) %>%
summarise(
SBP = first(SBP, -DATE)
)
#> # A tibble: 2 x 2
#> ID SBP
#> * <int> <int>
#> 1 1 106
#> 2 2 80
Created on 2021-05-18 by the reprex package (v1.0.0)
Upvotes: 1