Reputation: 956
I have a cohort study in which participants have a date-of-birth and dates when they entered and exited the study.
I am trying to calculate the time-at-risk (i.e. duration in the study) by age, sex and year.
For example, a participant who enters the study in June 2005 age 40.8 (in decimal) and remains in the study for one year would contribute 0.2 years in 2005 age 40, 0.3 years in 2005 age 41, and 0.5 years in 2006 age 41.
The data looks like this:
N <- 1000
set.seed(50)
d <- data.frame(
sex = sample(c('m', 'f'), N, replace = T, prob = c(0.7, 0.3)),
dob = sample(seq(as.Date('1960/01/01'), as.Date('1985/01/01'), by="day"), N, replace = T),
study_entry = sample(seq(as.Date('2000/01/01'), as.Date('2010/01/01'), by="day"), N, replace = T)
)
d$study_exit <- d$study_entry + runif(N, 10, 2000)
d$age_entry <- as.numeric(d$study_entry - d$dob) / 365.25
And I am trying to create a summary table of follow-up duration that looks like this:
+--------+------+-----+--------------+
| year | sex | age | time at risk |
+--------+------+-----+--------------+
| 2000 | male | 20 | .... |
+--------+------+-----+--------------+
| 2000 | male | 21 | .... |
+--------+------+-----+--------------+
| 2000 | male | 22 | .... |
+--------+------+-----+--------------+
| etc... | ... | ... | ... |
+--------+------+-----+--------------+
How would you go about this?
Upvotes: 3
Views: 401
Reputation: 956
I've been using @Wimpel's solution, which works brilliantly - and then also realised that it can be slightly simplified by building a table of days in the study cohort directly:
library(data.table)
library(lubridate)
setDT(d)
d[, id := .I] # patient ID
dt <- d[, .(day = seq(from = study_entry, to = study_exit, by = 'day')), id]
dt <- d[, c('id', 'dob', 'sex')][dt, on = 'id']
dt[, actual_age := floor(as.numeric(day - dob) / 365.25)]
dt[, year := lubridate::year(day)]
dt[, .(days_at_risk = .N), c('actual_age', 'sex', 'year')]
Upvotes: 0
Reputation: 11898
Another general framework would be to use Lexis tools from the Epi package to split up each observation according to both the calendar time and age time scales:
library(tidyverse)
# Define time scales
lex <- Epi::Lexis(
entry = list(year = study_entry),
exit = list(year = study_exit, age = study_exit - dob),
data = mutate_if(d, lubridate::is.Date, Epi::cal.yr)
)
# Split observations into follow-up periods
lex_split <- lex %>%
Epi::splitLexis(time.scale = "age", breaks = 15:50) %>%
Epi::splitLexis(time.scale = "year", breaks = 2000 + 0:10)
lex_split %>%
mutate(
age = Epi::timeBand(., "age", type = "left"),
year = Epi::timeBand(., "year", type = "left")
) %>%
group_by(year, sex, age) %>%
summarise(years_at_risk = sum(lex.dur))
#> # A tibble: 558 x 4
#> # Groups: year, sex [22]
#> year sex age years_at_risk
#> <dbl> <fct> <dbl> <dbl>
#> 1 2000 f 15 0.523
#> 2 2000 f 16 0.947
#> 3 2000 f 17 0.152
#> 4 2000 f 18 0.286
#> 5 2000 f 19 1.10
#> 6 2000 f 20 0.521
#> 7 2000 f 21 1.01
#> 8 2000 f 22 0.396
#> 9 2000 f 23 0.344
#> 10 2000 f 24 0.677
#> # ... with 548 more rows
Created on 2019-10-29 by the reprex package (v0.3.0)
Upvotes: 2
Reputation: 27772
Here is an approach usig data.table
and lubridate
.
What it basically does, if to perform an overlap-join of your data on a lookup-table that exist of daily periods for the span of the study.
Then, for each 'match', the actual age of the participant on that day is calculated, based on the date-of-birth, using lubridate::as.period()
.
When all this is known, summarising is pretty easy (although I did not know what you exactly wanted to summarise, so I just summarised the total sum of days each group was at risk during thet year).
library(data.table)
library(lubridate)
#set d as data.table
setDT(d)
#over which years spans the study
ymin <- min( lubridate::year( d$study_entry ) )
ymax <- max( lubridate::year( d$study_exit ) )
#create lookup table of all days in study
dt.lookup <- data.table( from = seq( as.Date( paste0( ymin, "-01-01" ) ),
as.Date( paste0( ymax, "-12-31" ) ),
by = "day") )
dt.lookup[, to := from ]
#set keys
setkey( dt.lookup, from, to )
setkey( d, study_entry, study_exit )
#use foverlaps to join both data-sets
dt <- foverlaps( d, dt.lookup )
#now, we can calculate the age based on `from` and dob.
# for other options, see: https://stackoverflow.com/questions/27096485/change-a-column-from-birth-date-to-age-in-r
dt[,
actual_age := floor(
as.numeric(
lubridate::as.period( interval( dob, from ), unit = "years" ),
"years" ) ) ]
#since each row is the duration of 1 day (from-to),
# we can summarise using that knowledge
dt[, .( days_at_risk = .N ),
by = .( year = lubridate::year( from),
sex = sex,
age = actual_age ) ]
output
# year sex age days_at_risk
# 1: 2000 m 26 1459
# 2: 2000 m 27 848
# 3: 2001 m 27 2678
# 4: 2001 m 28 1752
# 5: 2000 f 24 248
# ---
# 723: 2013 m 44 242
# 724: 2013 m 40 146
# 725: 2014 m 41 82
# 726: 2015 m 42 8
# 727: 2015 m 43 58
Upvotes: 3