Frank Harrell
Frank Harrell

Reputation: 2230

Carry forward a specific value if observations stop before a certain time

For a longitudinal dataset, I want to carry forward observations that terminated before day 7 with y=3, completing records with consecutive days up until day 7 with y=3. A related question is at How to make continuous time sequences within groups in data.table? . The following solution works but I would like to also have a solution that (1) subsetted the observations earlier (see below) or that (2) did the carry forward with a join in one step.

d <- data.table(t =c(1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 5, 6, 7, 1, 2, 3, 5, 6),
                id=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5),
                y =c(1, 2, 1, 2, 3, 1, 1, 1, 2, 2, 3, 3, 3, 1, 2, 2, 2, 3),
                x =c(0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0),
                key=c('id', 't'))
d

    t id y x
 1: 1  1 1 0
 2: 2  1 2 0
 3: 1  2 1 1
 4: 2  2 2 1
 5: 3  2 3 1
 6: 1  3 1 0
 7: 2  3 1 0
 8: 1  4 1 1
 9: 2  4 2 1
10: 3  4 2 1
11: 5  4 3 1
12: 6  4 3 1
13: 7  4 3 1
14: 1  5 1 0
15: 2  5 2 0
16: 3  5 2 0
17: 5  5 2 0
18: 6  5 3 0

w <- d[, .(tlast=t, last3 = t == max(t) & y == 3 & t < 7, x=x), by=id]
w <- w[last3 == TRUE, .(t = (tlast + 1) : 7, y=rep(3, 7 - tlast), x=x), by=id]
d <- rbind(d, w)
setkey(d, id, t)
d

   t id y x
 1: 1  1 1 0
 2: 2  1 2 0
 3: 1  2 1 1
 4: 2  2 2 1
 5: 3  2 3 1
 6: 4  2 3 1
 7: 5  2 3 1
 8: 6  2 3 1
 9: 7  2 3 1
10: 1  3 1 0
11: 2  3 1 0
12: 1  4 1 1
13: 2  4 2 1
14: 3  4 2 1
15: 5  4 3 1
16: 6  4 3 1
17: 7  4 3 1
18: 1  5 1 0
19: 2  5 2 0
20: 3  5 2 0
21: 5  5 2 0
22: 6  5 3 0
23: 7  5 3 0
    t id y x

The following doesn't work (results in data.table with 0 rows and 4 cols)

w <- d[(t == max(t) & y == 3 & t < 7) == TRUE, .SD, by=id]

Upvotes: 2

Views: 130

Answers (3)

r2evans
r2evans

Reputation: 160607

data.table

cols <- c("x", "y")
merge(d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t),
      by = .(id)], d, by = c("id", "t"), all.x = TRUE
  )[, (cols) := lapply(.SD, nafill, type = "locf"), by = .(id), .SDcols = cols][]
#        id     t     y     x
#     <num> <num> <num> <num>
#  1:     1     1     1     0
#  2:     1     2     2     0
#  3:     2     1     1     1
#  4:     2     2     2     1
#  5:     2     3     3     1
#  6:     2     4     3     1
#  7:     2     5     3     1
#  8:     2     6     3     1
#  9:     2     7     3     1
# 10:     3     1     1     0
# 11:     3     2     1     0
# 12:     4     1     1     1
# 13:     4     2     2     1
# 14:     4     3     2     1
# 15:     4     5     3     1
# 16:     4     6     3     1
# 17:     4     7     3     1
# 18:     5     1     1     0
# 19:     5     2     2     0
# 20:     5     3     2     0
# 21:     5     5     2     0
# 22:     5     6     3     0
# 23:     5     7     3     0
#        id     t     y     x

Walk-through:

  • we first need to generate a list that contains the t values we need per-id, so

    d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t), by = .(id)]
    #        id     t
    #     <num> <num>
    #  1:     1     1
    #  2:     1     2
    #  3:     2     1
    #  4:     2     2
    #  5:     2     3
    #  6:     2     4
    #  7:     2     5
    #  8:     2     6
    #  9:     2     7
    # 10:     3     1
    # 11:     3     2
    # 12:     4     1
    # 13:     4     2
    # 14:     4     3
    # 15:     4     5
    # 16:     4     6
    # 17:     4     7
    # 18:     5     1
    # 19:     5     2
    # 20:     5     3
    # 21:     5     5
    # 22:     5     6
    # 23:     5     7
    #        id     t
    

    This does not fill in missing steps (4 is missing in ids 3 and 4). If y contains 3, then we fill out t up to 7, otherwise we do nothing.

    Note: t here is numeric, which requires a little dancing with (integer) sequences, ergo the as.numeric to silence data.table's complaints about matching column types.

  • a simple merge against the original d will leave some NA holes in the data, which is intentional:

    merge(d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t), by = .(id)], d, by = c("id", "t"), all.x = TRUE)
    #        id     t     y     x
    #     <num> <num> <num> <num>
    #  1:     1     1     1     0
    #  2:     1     2     2     0
    #  3:     2     1     1     1
    #  4:     2     2     2     1
    #  5:     2     3     3     1
    #  6:     2     4    NA    NA
    #  7:     2     5    NA    NA
    #  8:     2     6    NA    NA
    #  9:     2     7    NA    NA
    # 10:     3     1     1     0
    # 11:     3     2     1     0
    # 12:     4     1     1     1
    # 13:     4     2     2     1
    # 14:     4     3     2     1
    # 15:     4     5     3     1
    # 16:     4     6     3     1
    # 17:     4     7     3     1
    # 18:     5     1     1     0
    # 19:     5     2     2     0
    # 20:     5     3     2     0
    # 21:     5     5     2     0
    # 22:     5     6     3     0
    # 23:     5     7    NA    NA
    #        id     t     y     x
    
  • from here, it's as simple as nafill(., type="locf"), using .SDcols for efficiency (and generality, so that we don't care what other columns there are, as long as cols lists them).


The reason that

d[(t == max(t) & y == 3 & t < 7) == TRUE, .SD, by=id]

returns 0 rows is that the i condition is first and not in .SD. Because of this, it is a global condition, not a per-group condition. The by= is not considered for the condition, so expression is equivalent to

d[(t == max(t) & y == 3 & t < 7),] # no .SD, no by=

which is also 0 rows. But looking at it that way, realize that there is only one row where t == max(t), row 13, where t is 7. On that row, y is 3 (so far so good), but t<7 is not true.

Changing it to the per-group thing within .SD returns data:

d[, .SD[(t == max(t) & y == 3 & t < 7),], by=id]
#       id     t     y     x
#    <num> <num> <num> <num>
# 1:     2     3     3     1
# 2:     5     6     3     0

Upvotes: 2

Damian
Damian

Reputation: 1433

If you don't mind using dplyr this might do what you're looking for. Split the data set into two groups then address the cases that need the 3rd row repeated. dplyr::summarize is normally used to collapse many rows into one, but it can also expand one into many.

library(dplyr)

# original data
d <- tibble(t =c(1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 5, 6, 7, 1, 2, 3, 5, 6),
            id=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5),
            y =c(1, 2, 1, 2, 3, 1, 1, 1, 2, 2, 3, 3, 3, 1, 2, 2, 2, 3),
            x =c(0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0)) %>%
        arrange(id, t)


bind_rows(
    
    
    d %>%
        # subset for cases to leave alone
        group_by(id) %>%
        filter( max(t) < 3 | 7 <= max(t))  %>%
        ungroup(),
    
    d %>%
        # subset for cases to fill in by carrying values forward
        group_by(id) %>%
        filter(3 <= max(t) & max(t) < 7) %>%

        # get the 3rd row
        filter(t == 3) %>%

        # repeat 3rd row until there are 7 rows
        summarize(t = seq(t, 7), x = x, y = y)  %>%
        ungroup()
    
) %>% arrange(id, t)

Output

# A tibble: 20 x 4
       t    id     y     x
   <dbl> <dbl> <dbl> <dbl>
 1     1     1     1     0
 2     2     1     2     0
 3     3     2     3     1
 4     4     2     3     1
 5     5     2     3     1
 6     6     2     3     1
 7     7     2     3     1
 8     1     3     1     0
 9     2     3     1     0
10     1     4     1     1
11     2     4     2     1
12     3     4     2     1
13     5     4     3     1
14     6     4     3     1
15     7     4     3     1
16     3     5     2     0
17     4     5     2     0
18     5     5     2     0
19     6     5     2     0
20     7     5     2     0

Upvotes: 0

eipi10
eipi10

Reputation: 93851

I'm not sure this is more efficient, but here's a dplyr approach (my data.table knowledge is too limited for this one). Hopefully, you can adapt a data.table version of this if you find it useful.

library(dplyr)

crossing(id=unique(d$id), t=1:max(d$t)) %>% 
  full_join(d) %>% 
  group_by(id) %>% 
  filter(!(max(y, na.rm=TRUE) < 3 & is.na(y)) &
           !(is.na(y) & !is.na(lead(y)))) %>% 
  mutate(across(c(y,x), zoo::na.locf)) 
      id     t     y     x
 1     1     1     1     0
 2     1     2     2     0
 3     2     1     1     1
 4     2     2     2     1
 5     2     3     3     1
 6     2     4     3     1
 7     2     5     3     1
 8     2     6     3     1
 9     2     7     3     1
10     3     1     1     0
11     3     2     1     0
12     4     1     1     1
13     4     2     2     1
14     4     3     2     1
15     4     5     3     1
16     4     6     3     1
17     4     7     3     1
18     5     1     1     0
19     5     2     2     0
20     5     3     2     0
21     5     5     2     0
22     5     6     3     0
23     5     7     3     0

Upvotes: 0

Related Questions