Reputation: 1211
Background
I've got this R
dataframe, d
. It looks like this:
d <- data.frame(ID = c("a","a","a","a","a","a","a","b","b","b","b"),
treatment = c(0,1,0,0,0,1,0,1,0,0,0),
event = c(0,0,1,1,1,1,1,0,1,1,1),
service_date = as.Date(c("2011-01-01",
"2011-08-21",
"2011-12-23",
"2012-02-23",
"2013-09-14",
"2013-04-07",
"2014-10-14",
"2013-01-01",
"2013-12-12",
"2014-06-17",
"2015-09-29")),
stringsAsFactors=FALSE)
It's got two people in it (ID
a and b) and some information about whether they received a treatment
, whether they had an event
, and a service_date
for when either of those things happens.
The problem & what I'm looking for
My goal is to figure out how many event==1
's people have on average in their n-th year after their first treatment==1
. Here's the result I'd want, and how I would do it by hand for the first year after treatment:
For each ID
, find the first service_date
where treatment
equals 1
. For ID
=a, that's 2011-08-21
.
For that "date of first treatment
", count forwards 365 days. For ID
=a, that'd be 2012-08-21
. This gives you an interval for "first year after first treatment
".
Within that interval, count/tally how many times event==1
. For ID
=a's first year (so between 2011-08-21
and 2012-08-21
), that's 2 times: once on 2011-12-23
and another on 2012-02-23
.
Repeat steps 1, 2, and 3 for the other ID
's (in this example it's only b) and get their count. For For ID
=b', this would only be one event: between 2013-01-01
and one year later on 2014-01-01
, they only have one event, on 2013-12-12
.
Sum the counts and divide by number of ID
's to get an average. Here, that'd be (2 events + 1 event) / 2 people == 1.5 events, on average, in Year 1 after first treatment
So in other words it's a calculation that should spit out a single number:
> d %>% ... etc etc ...
# A tibble: 1 x 1
mean
<dbl>
1 1.5
Ideally I'd like to be able to modify the code to define a different interval after first treatment
. Like year 2 could be "the time between first treatment
+365 and first treatment
+730".
What I've tried
I'm messing with some R code to try and do this. Conceptually, my approach consists of the following:
First, to mutate
a new column year_interval
using the difftime
function to define the interval in which R
should be counting events for each ID
.
Next, to mutate
another column interval_event_count
that does the actual counting.
Finish the operation using mean
.
This is probably not the only valid approach, of course (it may not even be valid at all 🙂).
So far, I've got this going, but it's giving me an error about difftime
:
d <- d %>%
group_by(ID) %>%
arrange(service_date) %>%
mutate(
year_interval = difftime(min(treatment==1), min(treatment==1)+365, units = "days"),
interval_event_count = tally(year_interval)) %>%
ungroup() %>%
mean(interval_event_count)
Error in `mutate_cols()`:
! Problem with `mutate()` column `year_interval`.
i `year_interval = difftime(min(treatment == 1), min(treatment == 1) + 365, units = "days")`.
x 'origin' must be supplied
i The error occurred in group 1: ID = "a".
Caused by error in `as.POSIXct.numeric()`:
! 'origin' must be supplied
Upvotes: 1
Views: 540
Reputation: 24832
Perhaps just build a small function that does the calculation, and also takes params s
and e
f <- function(tx,ev,d,s=0,e=365) {
tx1 = min(d[tx==1])
interval = c(tx1+s,tx1+e)
sum(ev[which(d>=interval[1] & d<=interval[2])])
}
Usage:
d %>% group_by(ID) %>%
summarize(ev = f(treatment, event, service_date)) %>%
summarize(result = mean(ev))
Output:
# A tibble: 1 x 1
result
<dbl>
1 1.5
If you want to get some other value, just change the default s
and e
, like this:
d %>% group_by(ID) %>%
summarize(ev = f(treatment, event, service_date,s=365, e=730)) %>%
summarize(result = mean(ev))
Even better, make a wrapper function, say get_events
, like this:
get_events <- function(dt,s=0, e=365) {
group_by(dt,ID) %>%
summarize(ev = f(treatment, event, service_date, s=s, e=e)) %>%
summarize(result = mean(ev))
}
and call it like this:
get_events(d)
get_events(d,365,730),
get_events(d,e=730)
Of course, if you a looking for speed, don't use group_by()/summarize()
. Instead, set d
to data.table, and run like this:
library(data.table)
setDT(d)[, f(treatment,event,service_date), by=ID][, mean(V1)]
Ouptut:
[1] 1.5
Upvotes: 1
Reputation: 887601
Here is one option with dplyr
- grouped by 'ID' and 'service_date', get the index of the first occurrence of 1 in 'treatment' with match
, to get the 'service_date_min', add 365 to return the 'service_date_max', then grouped by 'service_date_min' as well, get the sum
of 'treatment' (if it is binary, sum returns the count of 1s), then get the mean
of 'n' once we drop the last group i.e. service_date_min
library(dplyr)
d %>%
arrange(ID, service_date) %>%
group_by(ID) %>%
filter(cumsum(treatment == 1) > 0) %>%
mutate(service_date_min = service_date[match(1, treatment)],
service_date_max = service_date_min + 365 +1,
i1 = service_date > service_date_min &
service_date < service_date_max & event == 1) %>%
summarise(n = sum(i1)) %>%
mutate(n = case_when(n ==1 ~ 1, TRUE ~ sum(n)/n))
-output
# A tibble: 2 × 2
ID n
<chr> <dbl>
1 a 1.5
2 b 1
Upvotes: 2
Reputation: 79174
Here is a dplyr
doing step by step what you provided in your question:
d %>%
group_by(ID) %>%
mutate(x = first(service_date[treatment==1]),
y = first(service_date[treatment==1])+365+1
) %>%
rowwise() %>%
mutate(z = ifelse(between(service_date, x, y), 1, 0)) %>%
group_by(ID) %>%
summarise(count = (sum(z[event==1])+1)/2)
ID count
<chr> <dbl>
1 a 1.5
2 b 1
Upvotes: 2