AdamO
AdamO

Reputation: 4930

Easy way to calculate age period and cohort sample sizes in R

In prospective studies, you want to summarize how old your sample is, over which years they were observed, and how long they were observed altogether. These collectively are consider the age, period, and cohort time-scales of the sample.

The easiest way to illustrate is with simulated data:

Suppose these data summarize a cohort of clinic patients with their baseline ages and start and stop date of observation:

set.seed(123)
n <- 10000
Obs <- data.frame(
  'age' = sample(seq(40, 80, by=5), n, replace=T),
  'start' = as.Date(n0 <- runif(n, 10000, 12000), origin="1970-01-01"),
  'end' = as.Date(n0 + runif(n, 0, 3652.5), origin="1970-01-01")
)

I want a foo to take vectors

AgeCut <- c(0, 65, Inf)
Yrcut <- c(0, 2000, Inf)
DurCut <- c(0, 5, Inf)

And cross tabulate the number of individuals who fall into each possible permutation of those values for at least one day. Or, even more complicated-ly, the number of years a person falls into a category. For instance, a person who is 40 when they enter the sample at 1990 and stay in for 30 years would be in the yt65/bf2000/lt5year category for 5 years when they enter yt65/bf2000/gt5year and stay there for another 5 years when they enter yt65/af2000/gt5year for 15 more years and finally ot65/af2000/gt5year

For some reason, this is wracking my brain so heavily I can't calculate the actual desired output, even via some inefficient for loop, but the format and structure would be something like:

        AgeCut             YrCut            DurCut  NumObs
1 younger than 65    before 2000 less than 5 years    1000
2    65 and older    before 2000 less than 5 years    1000   
3 younger than 65 2000 and later less than 5 years    1000
4    65 and older 2000 and later less than 5 years    1000
5 younger than 65    before 2000   5 or more years    1000
6    65 and older    before 2000   5 or more years    1000
7 younger than 65 2000 and later   5 or more years    1000
8    65 and older 2000 and later   5 or more years    1000

Upvotes: 1

Views: 258

Answers (2)

AdamO
AdamO

Reputation: 4930

OK I have this implementation in base R. It recursively evaluates the time spent in the current category until moving to the next one, adds that duration to the various counters and subtracts it from the overall duration of study participation, then feeds the updated times and durations into the apc function.

apc <- function(times, cuts, dur, strata=1) {
  class <- mapply(findInterval, times, cuts)
  tnext <- mapply( ## times until next category
    function(t, c, i) {c[i+1] - t}, 
    times, cuts, as.data.frame(class)
  )
  mnext <- apply(tnext, 1, min, na.rm=T) ## minimum time to next category
  mnext <- pmin(mnext, dur) ## truncate if duration exceeded before next
  dur <- dur-mnext
  times <- lapply(times, `+`, mnext)
  if (all(dur == 0))
    return(list(data.frame(class, 't'=mnext, strata)))
  return(c(list(data.frame(class, 't'=mnext, strata)), apc(times, cuts, dur, strata=strata)))
}

This estimates the following number of person years in each category as:

> val
  age start cohort strata         t
1   1     1      1      1  3175.986
2   2     1      1      1  2582.793
3   1     2      1      1 17714.503
4   2     2      1      1 13972.134
5   1     2      2      1  5658.430
6   2     2      2      1  6957.702

which the sum (50,061.55) is equal to the sum of Obs$end-Obs$start.

Upvotes: 1

MrFlick
MrFlick

Reputation: 206411

Using some tidyverse functions, I think you want something like this

library(tidyverse)
AgeCut <- c(0, 65, Inf)
Yrcut <- c(0, 2000, Inf)
DurCut <- c(0, 5, Inf)

Obs %>% transmute (
  ageCat = cut(age, AgeCut, c("younger than 65 ","65 and older"), right=FALSE),
  startCat = cut(year(start), Yrcut, c("before 2000", "2000 and later"), right=FALSE),
  DurCut = cut(year(end)-year(start), DurCut, c("less than 5 years", "5 or more years"), right=FALSE)
)  %>% table() %>% as_data_frame()

This returns

            ageCat       startCat            DurCut     n
             <chr>          <chr>             <chr> <int>
1 younger than 65     before 2000 less than 5 years  1196
2     65 and older    before 2000 less than 5 years   968
3 younger than 65  2000 and later less than 5 years  1312
4     65 and older 2000 and later less than 5 years  1015
5 younger than 65     before 2000   5 or more years  1503
6     65 and older    before 2000   5 or more years  1185
7 younger than 65  2000 and later   5 or more years  1580
8     65 and older 2000 and later   5 or more years  1241

The cut() function is doing most of the work here.

Upvotes: 1

Related Questions