HNSKD
HNSKD

Reputation: 1644

How to use summarise to take the value of an variable that corresponds to the max. value of another variable?

How to use summarise to take the value of a variable that corresponds to the maximum value of another variable?

Data: I have a simplified dataset below.

df <- read.table(text = "
                 ID SBP DATE
                 1 90 20210102
                 1 106 20210111
                 2 80 20210513
                 2 87 20210513
                 2 88 20210413", header = TRUE)

I wish to take the value of SBP, which corresponds to the latest DATE (i.e. most recent measurement of systolic blood pressure). There may be ties, i.e. > 1 measurement within the same day (as shown in ID=2) and in such case, I would like to take the first line. Besides this, I may need to obtain other variables, like mean of SBP, no. of measurements of SBP etc. Hence, I would like to use only summarise(). Below is the desired output.

Desired output:

df <- read.table(text = "
                 ID SBP 
                 1 106 
                 2 80", header = TRUE)

Here's what I have done previously.

1) Using summarise with [ and which.max

df %>% group_by(ID) %>% summarise(SBP = SBP[which.max(DATE)])
## A tibble: 2 x 2
#     ID   SBP
#  <int> <int>
#1     1   106
#2     2    80

2) Using slice_max

df %>% group_by(ID) %>% slice_max(DATE, with_ties = FALSE)
## A tibble: 2 x 2
#     ID   SBP
#  <int> <int>
#1     1   106
#2     2    80

3) Using summarise with last

df %>% group_by(ID) %>% summarise(SBP = last(SBP, DATE))
## A tibble: 2 x 2
#     ID   SBP
#  <int> <int>
#1     1   106
#2     2    87

I think (3) is ideal in terms of readability, but instead of taking the first line item, it takes the last line item (not what I want). If I use (2), I would have to use mutate to create other variables of interest (like number of measurements, mean etc.) before using slice_max. (1) would be confusing to other R readers/users.

My question: How can I write something like (3), but taking the first line, when there are ties?

Upvotes: 1

Views: 691

Answers (1)

Peter H.
Peter H.

Reputation: 2174

I would use either 1) arrange + distinct or 2) group_by + summarise + first . The first approach is not as readable, but with big datasets it's actually more performant than using group by.

library(tidyverse)

df %>%
  arrange(ID, -DATE) %>% 
  distinct(ID, .keep_all = TRUE)
#>   ID SBP     DATE
#> 1  1 106 20210111
#> 2  2  80 20210513

df %>% 
  group_by(ID) %>% 
  summarise(
    SBP = first(SBP, -DATE)
  )
#> # A tibble: 2 x 2
#>      ID   SBP
#> * <int> <int>
#> 1     1   106
#> 2     2    80

Created on 2021-05-18 by the reprex package (v1.0.0)

Upvotes: 1

Related Questions