Reputation: 2230
For a longitudinal dataset, I want to carry forward observations that terminated before day 7 with y=3, completing records with consecutive days up until day 7 with y=3. A related question is at How to make continuous time sequences within groups in data.table? . The following solution works but I would like to also have a solution that (1) subsetted the observations earlier (see below) or that (2) did the carry forward with a join in one step.
d <- data.table(t =c(1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 5, 6, 7, 1, 2, 3, 5, 6),
id=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5),
y =c(1, 2, 1, 2, 3, 1, 1, 1, 2, 2, 3, 3, 3, 1, 2, 2, 2, 3),
x =c(0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0),
key=c('id', 't'))
d
t id y x
1: 1 1 1 0
2: 2 1 2 0
3: 1 2 1 1
4: 2 2 2 1
5: 3 2 3 1
6: 1 3 1 0
7: 2 3 1 0
8: 1 4 1 1
9: 2 4 2 1
10: 3 4 2 1
11: 5 4 3 1
12: 6 4 3 1
13: 7 4 3 1
14: 1 5 1 0
15: 2 5 2 0
16: 3 5 2 0
17: 5 5 2 0
18: 6 5 3 0
w <- d[, .(tlast=t, last3 = t == max(t) & y == 3 & t < 7, x=x), by=id]
w <- w[last3 == TRUE, .(t = (tlast + 1) : 7, y=rep(3, 7 - tlast), x=x), by=id]
d <- rbind(d, w)
setkey(d, id, t)
d
t id y x
1: 1 1 1 0
2: 2 1 2 0
3: 1 2 1 1
4: 2 2 2 1
5: 3 2 3 1
6: 4 2 3 1
7: 5 2 3 1
8: 6 2 3 1
9: 7 2 3 1
10: 1 3 1 0
11: 2 3 1 0
12: 1 4 1 1
13: 2 4 2 1
14: 3 4 2 1
15: 5 4 3 1
16: 6 4 3 1
17: 7 4 3 1
18: 1 5 1 0
19: 2 5 2 0
20: 3 5 2 0
21: 5 5 2 0
22: 6 5 3 0
23: 7 5 3 0
t id y x
The following doesn't work (results in data.table with 0 rows and 4 cols)
w <- d[(t == max(t) & y == 3 & t < 7) == TRUE, .SD, by=id]
Upvotes: 2
Views: 130
Reputation: 160607
cols <- c("x", "y")
merge(d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t),
by = .(id)], d, by = c("id", "t"), all.x = TRUE
)[, (cols) := lapply(.SD, nafill, type = "locf"), by = .(id), .SDcols = cols][]
# id t y x
# <num> <num> <num> <num>
# 1: 1 1 1 0
# 2: 1 2 2 0
# 3: 2 1 1 1
# 4: 2 2 2 1
# 5: 2 3 3 1
# 6: 2 4 3 1
# 7: 2 5 3 1
# 8: 2 6 3 1
# 9: 2 7 3 1
# 10: 3 1 1 0
# 11: 3 2 1 0
# 12: 4 1 1 1
# 13: 4 2 2 1
# 14: 4 3 2 1
# 15: 4 5 3 1
# 16: 4 6 3 1
# 17: 4 7 3 1
# 18: 5 1 1 0
# 19: 5 2 2 0
# 20: 5 3 2 0
# 21: 5 5 2 0
# 22: 5 6 3 0
# 23: 5 7 3 0
# id t y x
Walk-through:
we first need to generate a list that contains the t
values we need per-id
, so
d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t), by = .(id)]
# id t
# <num> <num>
# 1: 1 1
# 2: 1 2
# 3: 2 1
# 4: 2 2
# 5: 2 3
# 6: 2 4
# 7: 2 5
# 8: 2 6
# 9: 2 7
# 10: 3 1
# 11: 3 2
# 12: 4 1
# 13: 4 2
# 14: 4 3
# 15: 4 5
# 16: 4 6
# 17: 4 7
# 18: 5 1
# 19: 5 2
# 20: 5 3
# 21: 5 5
# 22: 5 6
# 23: 5 7
# id t
This does not fill in missing steps (4
is missing in id
s 3 and 4). If y
contains 3
, then we fill out t
up to 7, otherwise we do nothing.
Note: t
here is numeric
, which requires a little dancing with (integer
) sequences, ergo the as.numeric
to silence data.table's complaints about matching column types.
a simple merge against the original d
will leave some NA
holes in the data, which is intentional:
merge(d[, .(t = if (3 %in% y && max(t) < 7) as.numeric(c(t, (1+max(t)):7)) else t), by = .(id)], d, by = c("id", "t"), all.x = TRUE)
# id t y x
# <num> <num> <num> <num>
# 1: 1 1 1 0
# 2: 1 2 2 0
# 3: 2 1 1 1
# 4: 2 2 2 1
# 5: 2 3 3 1
# 6: 2 4 NA NA
# 7: 2 5 NA NA
# 8: 2 6 NA NA
# 9: 2 7 NA NA
# 10: 3 1 1 0
# 11: 3 2 1 0
# 12: 4 1 1 1
# 13: 4 2 2 1
# 14: 4 3 2 1
# 15: 4 5 3 1
# 16: 4 6 3 1
# 17: 4 7 3 1
# 18: 5 1 1 0
# 19: 5 2 2 0
# 20: 5 3 2 0
# 21: 5 5 2 0
# 22: 5 6 3 0
# 23: 5 7 NA NA
# id t y x
from here, it's as simple as nafill(., type="locf")
, using .SDcols
for efficiency (and generality, so that we don't care what other columns there are, as long as cols
lists them).
The reason that
d[(t == max(t) & y == 3 & t < 7) == TRUE, .SD, by=id]
returns 0 rows is that the i
condition is first and not in .SD
. Because of this, it is a global condition, not a per-group condition. The by=
is not considered for the condition, so expression is equivalent to
d[(t == max(t) & y == 3 & t < 7),] # no .SD, no by=
which is also 0 rows. But looking at it that way, realize that there is only one row where t == max(t)
, row 13, where t
is 7. On that row, y
is 3 (so far so good), but t<7
is not true.
Changing it to the per-group thing within .SD
returns data:
d[, .SD[(t == max(t) & y == 3 & t < 7),], by=id]
# id t y x
# <num> <num> <num> <num>
# 1: 2 3 3 1
# 2: 5 6 3 0
Upvotes: 2
Reputation: 1433
If you don't mind using dplyr
this might do what you're looking for. Split the data set into two groups then address the cases that need the 3rd row repeated. dplyr::summarize
is normally used to collapse many rows into one, but it can also expand one into many.
library(dplyr)
# original data
d <- tibble(t =c(1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 5, 6, 7, 1, 2, 3, 5, 6),
id=c(1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5),
y =c(1, 2, 1, 2, 3, 1, 1, 1, 2, 2, 3, 3, 3, 1, 2, 2, 2, 3),
x =c(0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0)) %>%
arrange(id, t)
bind_rows(
d %>%
# subset for cases to leave alone
group_by(id) %>%
filter( max(t) < 3 | 7 <= max(t)) %>%
ungroup(),
d %>%
# subset for cases to fill in by carrying values forward
group_by(id) %>%
filter(3 <= max(t) & max(t) < 7) %>%
# get the 3rd row
filter(t == 3) %>%
# repeat 3rd row until there are 7 rows
summarize(t = seq(t, 7), x = x, y = y) %>%
ungroup()
) %>% arrange(id, t)
Output
# A tibble: 20 x 4
t id y x
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1 2 0
3 3 2 3 1
4 4 2 3 1
5 5 2 3 1
6 6 2 3 1
7 7 2 3 1
8 1 3 1 0
9 2 3 1 0
10 1 4 1 1
11 2 4 2 1
12 3 4 2 1
13 5 4 3 1
14 6 4 3 1
15 7 4 3 1
16 3 5 2 0
17 4 5 2 0
18 5 5 2 0
19 6 5 2 0
20 7 5 2 0
Upvotes: 0
Reputation: 93851
I'm not sure this is more efficient, but here's a dplyr
approach (my data.table
knowledge is too limited for this one). Hopefully, you can adapt a data.table
version of this if you find it useful.
library(dplyr)
crossing(id=unique(d$id), t=1:max(d$t)) %>%
full_join(d) %>%
group_by(id) %>%
filter(!(max(y, na.rm=TRUE) < 3 & is.na(y)) &
!(is.na(y) & !is.na(lead(y)))) %>%
mutate(across(c(y,x), zoo::na.locf))
id t y x 1 1 1 1 0 2 1 2 2 0 3 2 1 1 1 4 2 2 2 1 5 2 3 3 1 6 2 4 3 1 7 2 5 3 1 8 2 6 3 1 9 2 7 3 1 10 3 1 1 0 11 3 2 1 0 12 4 1 1 1 13 4 2 2 1 14 4 3 2 1 15 4 5 3 1 16 4 6 3 1 17 4 7 3 1 18 5 1 1 0 19 5 2 2 0 20 5 3 2 0 21 5 5 2 0 22 5 6 3 0 23 5 7 3 0
Upvotes: 0