JensRtee
JensRtee

Reputation: 41

Finding cumulative second max per group in R

I have a dataset where I would like to create a new variable that is the cumulative second largest value of another variable, and I would like to perform this function per group.

Let's say I create the following example data frame:

(df1 <- data.frame(patient = rep(1:5, each=8), visit = rep(1:2,each=4,5), trial = rep(1:4,10), var1 = sample(1:50,20,replace=TRUE)))

This is pretend data that represents 5 patients who each had 2 study visits, and each visit had 4 trials with a measurement taken (var1).

> head(df1,n=20)
   patient visit trial var1
1        1     1     1   25
2        1     1     2   23
3        1     1     3   48
4        1     1     4   37
5        1     2     1   41
6        1     2     2   45
7        1     2     3    8
8        1     2     4    9
9        2     1     1   26
10       2     1     2   14
11       2     1     3   41
12       2     1     4   35
13       2     2     1   37
14       2     2     2   30
15       2     2     3   14
16       2     2     4   28
17       3     1     1   34
18       3     1     2   19
19       3     1     3   28
20       3     1     4   10

I would like to create a new variable, cum2ndmax, that is the cumulative 2nd largest value of var1 and I would like to group this variable by patient # and visit #.

I figured out how to calculate the cumulative 2nd max number like so:

df1$cum2ndmax <- sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]})
df1

However, this calculates the cumulative 2nd max across the whole dataset, not for each group. I have attempted to calculate this variable using grouped data like so after installing and loading package dplyr:

library(dplyr)    
df2 <- df1 %>%
        group_by(patient,visit) %>%
        mutate(cum2ndmax = sapply(seq_along(df1$var1),function(x){sort(df1$var1[seq(x)],decreasing=TRUE)[2]}))

But I get an error: Error: Problem with mutate() input cum2ndmax. x Input cum2ndmax can't be recycled to size 4.

Ideally, my result would look something like this:

patient visit   trial   var1    cum2ndmax
1   1   1   25  NA
1   1   2   23  23
1   1   3   48  25
1   1   4   37  37
1   2   1   41  NA
1   2   2   45  41
1   2   3   8   41
1   2   4   9   41
2   1   1   26  NA
2   1   2   14  14
2   1   3   41  26
2   1   4   35  35
…   …   …   …   …

Any help in getting this to work in R would be much appreciated! Thank you!

Upvotes: 4

Views: 229

Answers (3)

JensRtee
JensRtee

Reputation: 41

Thanks so much everyone! I really appreciate it and could not have solved this without your help! In the end, I ended up using a similar approach suggested by tmfmnk since I was already using dplyr. I found an interesting result with the code suggested by tmkmnk where for some reason it gave me a column of values that just repeated the first row's number. With a small tweak to change dense_rank to order, I got exactly what I wanted like this:

df1 %>%
 group_by(patient, visit) %>%
 mutate(cum_second_max = map_dbl(.x = seq_along(var1),
                                 ~ ifelse(.x == 1, NA, var1[order(-var1[1:.x])[2])))

Upvotes: 0

Paul
Paul

Reputation: 9087

Here is an Rcpp solution.

cum_second_max is a modification of cummax which keeps track of the second maximum.

library(tidyverse)

Rcpp::cppFunction("
NumericVector cum_second_max(NumericVector x) {
  double max_value = R_NegInf, max_value2 = NA_REAL;
  NumericVector result(x.length());

  for (int i = 0 ; i < x.length() ; ++i) {
    if (x[i] > max_value) {
      max_value2 = max_value;
      max_value = x[i];
    }
    else if (x[i] < max_value && x[i] > max_value2) {
      max_value2 = x[i];
    }
    result[i] = isinf(max_value2) ? NA_REAL : max_value2;
  }
  return result;
}
")

df1 %>%
  group_by(patient, visit) %>%
  mutate(
    c2max = cum_second_max(var1)
  )
#> # A tibble: 20 x 5
#> # Groups:   patient, visit [5]
#>    patient visit trial  var1 c2max
#>      <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1       1     1     1    25    NA
#>  2       1     1     2    23    23
#>  3       1     1     3    48    25
#>  4       1     1     4    37    37
#>  5       1     2     1    41    NA
#>  6       1     2     2    45    41
#>  7       1     2     3     8    41
#>  8       1     2     4     9    41
#>  9       2     1     1    26    NA
#> 10       2     1     2    14    14
#> 11       2     1     3    41    26
#> 12       2     1     4    35    35
#> 13       2     2     1    37    NA
#> 14       2     2     2    30    30
#> 15       2     2     3    14    30
#> 16       2     2     4    28    30
#> 17       3     1     1    34    NA
#> 18       3     1     2    19    19
#> 19       3     1     3    28    28
#> 20       3     1     4    10    28

Upvotes: 2

tmfmnk
tmfmnk

Reputation: 39858

One dplyr and purrr option could be:

df1 %>%
 group_by(patient, visit) %>%
 mutate(cum_second_max = map_dbl(.x = seq_along(var1),
                                 ~ ifelse(.x == 1, NA, var1[dense_rank(-var1[1:.x]) == 2])))

   patient visit trial  var1 cum_second_max
     <int> <int> <int> <int>          <dbl>
 1       1     1     1    25             NA
 2       1     1     2    23             23
 3       1     1     3    48             25
 4       1     1     4    37             37
 5       1     2     1    41             NA
 6       1     2     2    45             41
 7       1     2     3     8             41
 8       1     2     4     9             41
 9       2     1     1    26             NA
10       2     1     2    14             14
11       2     1     3    41             26
12       2     1     4    35             35
13       2     2     1    37             NA
14       2     2     2    30             30
15       2     2     3    14             30
16       2     2     4    28             30
17       3     1     1    34             NA
18       3     1     2    19             19
19       3     1     3    28             28
20       3     1     4    10             28

Upvotes: 4

Related Questions