Background I've got this R dataframe, d . It looks like this: d <- data.frame(ID = c("a","a","a","a","a","a","a","b","b","b","b"), treatment = c(0,1,0,0,0,1,0,1,0,0,0), event = c(0,0,1,1,1,1,1,0,1,1,1), service_date = as.Date(c("2011-01-01", "2011-08-21", "2011-12-23", "2012-02-23", "2013-09-14", "2013-04-07", "2014-10-14", "2013-01-01", "2013-12-12", "2014-06-17", "2015-09-29")), stringsAsFactors=FALSE) It's got two people in it ( ID a and b ) and some information about whether they received a treatment , whether they had an event , and a service_date for when either of those things happens. The problem & what I'm looking for My goal is to figure out how many event==1 's people have on average in their n -th year after their first treatment==1 . Here's the result I'd want, and how I would do it by hand for the first year after treatment: For each ID , find the first service_date where treatment equals 1 . For ID =a, that's 2011-08-21 . For that "date of first treatment ", count forwards 365 days. For ID =a, that'd be 2012-08-21 . This gives you an interval for "first year after first treatment ". Within that interval, count/tally how many times event==1 . For ID =a's first year (so between 2011-08-21 and 2012-08-21 ), that's 2 times: once on 2011-12-23 and another on 2012-02-23 . Repeat steps 1, 2, and 3 for the other ID 's (in this example it's only b) and get their count. For For ID =b', this would only be one event: between 2013-01-01 and one year later on 2014-01-01 , they only have one event, on 2013-12-12 . Sum the counts and divide by number of ID 's to get an average. Here, that'd be (2 events + 1 event) / 2 people == 1.5 events, on average, in Year 1 after first treatment So in other words it's a calculation that should spit out a single number: > d %>% ... etc etc ... # A tibble: 1 x 1 mean <dbl> 1 1.5 Ideally I'd like to be able to modify the code to define a different interval after first treatment . Like year 2 could be "the time between first treatment +365 and first treatment +730". What I've tried I'm messing with some R code to try and do this. Conceptually, my approach consists of the following: First, to mutate a new column year_interval using the difftime function to define the interval in which R should be counting events for each ID . Next, to mutate another column interval_event_count that does the actual counting. Finish the operation using mean . This is probably not the only valid approach, of course (it may not even be valid at all 🙂). So far, I've got this going, but it's giving me an error about difftime : d <- d %>% group_by(ID) %>% arrange(service_date) %>% mutate( year_interval = difftime(min(treatment==1), min(treatment==1)+365, units = "days"), interval_event_count = tally(year_interval)) %>% ungroup() %>% mean(interval_event_count) Error in `mutate_cols()`: ! Problem with `mutate()` column `year_interval`. i `year_interval = difftime(min(treatment == 1), min(treatment == 1) + 365, units = "days")`. x 'origin' must be supplied i The error occurred in group 1: ID = "a". Caused by error in `as.POSIXct.numeric()`: ! 'origin' must be supplied

Reputation: 1211

In R, what is the average frequency (count) of events per ID in Year N?

Background

I've got this R dataframe, d. It looks like this:

d <- data.frame(ID = c("a","a","a","a","a","a","a","b","b","b","b"),
                treatment = c(0,1,0,0,0,1,0,1,0,0,0),
                event = c(0,0,1,1,1,1,1,0,1,1,1),
                service_date = as.Date(c("2011-01-01",   
                                         "2011-08-21",   
                                         "2011-12-23",   
                                         "2012-02-23",   
                                         "2013-09-14",   
                                         "2013-04-07",   
                                         "2014-10-14",   
                                         "2013-01-01",
                                         "2013-12-12",   
                                         "2014-06-17",
                                         "2015-09-29")), 
                stringsAsFactors=FALSE)

It's got two people in it (ID a and b) and some information about whether they received a treatment, whether they had an event, and a service_date for when either of those things happens.

The problem & what I'm looking for

My goal is to figure out how many event==1's people have on average in their n-th year after their first treatment==1. Here's the result I'd want, and how I would do it by hand for the first year after treatment:

For each ID, find the first service_date where treatment equals 1. For ID=a, that's 2011-08-21.
For that "date of first treatment", count forwards 365 days. For ID=a, that'd be 2012-08-21. This gives you an interval for "first year after first treatment".
Within that interval, count/tally how many times event==1. For ID=a's first year (so between 2011-08-21 and 2012-08-21), that's 2 times: once on 2011-12-23 and another on 2012-02-23.
Repeat steps 1, 2, and 3 for the other ID's (in this example it's only b) and get their count. For For ID=b', this would only be one event: between 2013-01-01 and one year later on 2014-01-01, they only have one event, on 2013-12-12.
Sum the counts and divide by number of ID's to get an average. Here, that'd be (2 events + 1 event) / 2 people == 1.5 events, on average, in Year 1 after first treatment

So in other words it's a calculation that should spit out a single number:

> d %>% ... etc etc ... 

# A tibble: 1 x 1
   mean
  <dbl>
1   1.5

Ideally I'd like to be able to modify the code to define a different interval after first treatment. Like year 2 could be "the time between first treatment+365 and first treatment+730".

What I've tried

I'm messing with some R code to try and do this. Conceptually, my approach consists of the following:

First, to mutate a new column year_interval using the difftime function to define the interval in which R should be counting events for each ID.
Next, to mutate another column interval_event_count that does the actual counting.
Finish the operation using mean.

This is probably not the only valid approach, of course (it may not even be valid at all 🙂).

So far, I've got this going, but it's giving me an error about difftime:

d <- d %>%
  group_by(ID) %>%
  arrange(service_date) %>%
  mutate(
    year_interval = difftime(min(treatment==1), min(treatment==1)+365, units = "days"),
    interval_event_count = tally(year_interval)) %>%
  ungroup() %>%
  mean(interval_event_count)

Error in `mutate_cols()`:
! Problem with `mutate()` column `year_interval`.
i `year_interval = difftime(min(treatment == 1), min(treatment == 1) + 365, units = "days")`.
x 'origin' must be supplied
i The error occurred in group 1: ID = "a".
Caused by error in `as.POSIXct.numeric()`:
! 'origin' must be supplied

Upvotes: 1

Answers (3)

langtang

Reputation: 24832

Perhaps just build a small function that does the calculation, and also takes params s and e

f <- function(tx,ev,d,s=0,e=365) {
  tx1 = min(d[tx==1])
  interval = c(tx1+s,tx1+e)
  sum(ev[which(d>=interval[1] & d<=interval[2])])
}

Usage:

d %>% group_by(ID) %>%
  summarize(ev = f(treatment, event, service_date)) %>% 
  summarize(result = mean(ev))

Output:

# A tibble: 1 x 1
  result
   <dbl>
1    1.5

If you want to get some other value, just change the default s and e, like this:

d %>% group_by(ID) %>%
  summarize(ev = f(treatment, event, service_date,s=365, e=730)) %>% 
  summarize(result = mean(ev))

Even better, make a wrapper function, say get_events, like this:

get_events <- function(dt,s=0, e=365) {
  group_by(dt,ID) %>%
    summarize(ev = f(treatment, event, service_date, s=s, e=e)) %>% 
    summarize(result = mean(ev))
}

and call it like this:

get_events(d)
get_events(d,365,730),
get_events(d,e=730)

Of course, if you a looking for speed, don't use group_by()/summarize(). Instead, set d to data.table, and run like this:

library(data.table)
setDT(d)[, f(treatment,event,service_date), by=ID][, mean(V1)]

Ouptut:

[1] 1.5

Upvotes: 1

akrun

Reputation: 887601

Here is one option with dplyr - grouped by 'ID' and 'service_date', get the index of the first occurrence of 1 in 'treatment' with match, to get the 'service_date_min', add 365 to return the 'service_date_max', then grouped by 'service_date_min' as well, get the sum of 'treatment' (if it is binary, sum returns the count of 1s), then get the mean of 'n' once we drop the last group i.e. service_date_min

library(dplyr)
d %>% 
  arrange(ID, service_date) %>%
  group_by(ID) %>% 
  filter(cumsum(treatment == 1) > 0) %>%
  mutate(service_date_min = service_date[match(1, treatment)], 
   service_date_max = service_date_min + 365 +1,
    i1 = service_date > service_date_min & 
    service_date < service_date_max & event == 1) %>%
   summarise(n = sum(i1)) %>%
   mutate(n = case_when(n ==1 ~ 1, TRUE ~ sum(n)/n))

-output

# A tibble: 2 × 2
  ID        n
  <chr> <dbl>
1 a       1.5
2 b       1

Upvotes: 2

TarJae

Reputation: 79174

Here is a dplyr doing step by step what you provided in your question:

d %>% 
  group_by(ID) %>% 
  mutate(x = first(service_date[treatment==1]),
         y = first(service_date[treatment==1])+365+1
         ) %>% 
  rowwise() %>% 
  mutate(z =  ifelse(between(service_date, x, y), 1, 0)) %>% 
  group_by(ID) %>% 
  summarise(count = (sum(z[event==1])+1)/2)

  ID    count
  <chr> <dbl>
1 a       1.5
2 b       1

Upvotes: 2

In R, what is the average frequency (count) of events per ID in Year N?

Answers (3)

Related Questions