Ell1954
Ell1954

Reputation: 3

How to find whether multiple years are within patient follow-up R

Please feel free to remove question if it's technically a duplicate, but I've looked at a lot of similar answers and none of them work for my data.

I have patient follow-up data, like so:

ID   start.date   end.date
1    1999-03-02   2003-06-15
2    1995-11-23   2007-09-26
..
.. 
n    2007-02-19   2010-08-06

This is very simplified, I have over 4 million ids.

I'm trying to find how many ids were registered each year from 1990 to 2016 in order to calculate an incidence rate for each year (disease status column omitted). I'd like a dataset like the following:

ID   start.date   end.date    y1990 ... y1995 ..  y2000 ..  y2005 ..  y2016
1    1999-03-02   2003-06-15    0         0         1          0        0
2    1990-11-23   2007-09-26    1         1         1          1        0
..
.. 
n    2005-02-19   2016-08-06    0         0         0          1        1

Each column has the value of 1 if that patient is still "registered" in that year, and 0 if they're not.

As a side note if anyone knows of a package in R that can calculate stratified incidence, that would be better, but as it stands I can't get any of them to do what I want them to.

I've tried various solutions for data.table, lubridate, and dplyr all to no avail. Help would be much appreciated.

Upvotes: 0

Views: 119

Answers (3)

chinsoon12
chinsoon12

Reputation: 25225

Here is another option using package:

library(data.table)
dat <- fread("ID   start.date   end.date
0    1990-11-23   2007-09-26
1    1999-03-02   2003-06-15
2    1995-11-23   2007-09-26
3    2007-02-19   2010-08-06
4    2005-02-19   2016-08-06")

#convert columns to Date class
cols <- names(dat)[-1L]
dat[, (cols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=cols]

#get start and end years
dat[, ':=' (startyear=year(start.date), endyear=year(end.date))]

#create a table of sequencing years to be used for joining
period <- data.table(yr=1990:2016, YEAR=1990:2016)

dcast(
    #perform a non-equi join between years sequence and dataset
    period[dat, on=.(yr >= startyear, yr <= endyear)], 
    #pivot results according to OP's request
    ID + start.date + end.date ~ YEAR, 
    length, 
    value.var="YEAR"
) 

output:

   ID start.date   end.date 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
1:  0 1990-11-23 2007-09-26    1    1    1    1    1    1    1    1    1    1    1
2:  1 1999-03-02 2003-06-15    0    0    0    0    0    0    0    0    0    1    1
3:  2 1995-11-23 2007-09-26    0    0    0    0    0    1    1    1    1    1    1
4:  3 2007-02-19 2010-08-06    0    0    0    0    0    0    0    0    0    0    0
5:  4 2005-02-19 2016-08-06    0    0    0    0    0    0    0    0    0    0    0
   2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
1:    1    1    1    1    1    1    1    0    0    0    0    0    0    0    0    0
2:    1    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0
3:    1    1    1    1    1    1    1    0    0    0    0    0    0    0    0    0
4:    0    0    0    0    0    0    1    1    1    1    0    0    0    0    0    0
5:    0    0    0    0    1    1    1    1    1    1    1    1    1    1    1    1

Upvotes: 0

AndS.
AndS.

Reputation: 8120

Another option you could try:

library(tidyverse)
library(lubridate)

data_frame(year = rep(1999:2009, each = nrow(df)), ID = rep(df$ID, 2009-1998)) %>%
    left_join(df, ., by = "ID") %>% 
    mutate(int = interval(parse_date_time(substring(start.date,1,4), orders = "y"), parse_date_time(substring(end.date,1,4), orders = "y"))) %>%
    mutate(val = ifelse(parse_date_time(year, orders = "y") %within% int, 1, 0)) %>% 
    spread(year, val) %>% 
    rename_at(vars(`1999`:`2009`), funs(paste0("y", .)))
#   ID start.date   end.date                            int y1999 y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009
# 1  1 1999-03-02 2003-06-15 1999-03-02 UTC--2003-06-15 UTC     1     1     1     1     1     0     0     0     0     0     0
# 2  2 1995-11-23 2007-09-26 1995-11-23 UTC--2007-09-26 UTC     1     1     1     1     1     1     1     1     1     0     0
# 3  n 2007-02-19 2010-08-06 2007-02-19 UTC--2010-08-06 UTC     0     0     0     0     0     0     0     0     1     1     1

This sets up a time interval and evaluates if the year is within that time interval. Also note that I only set up the code to go from 1999 to 2009 for convenience here.

Upvotes: 1

Maurits Evers
Maurits Evers

Reputation: 50718

You could do something like this:

library(tidyverse)
df %>%
    mutate(year = as.numeric(sub("-\\d+-\\d+$", "", start.date))) %>%
    group_by(ID) %>%
    mutate(n = 1:n()) %>%
    spread(year, n, fill = 0)
## A tibble: 3 x 6
## Groups:   ID [3]
#  ID    start.date end.date   `1995` `1999` `2007`
#  <fct> <fct>      <fct>       <dbl>  <dbl>  <dbl>
#1 1     1999-03-02 2003-06-15     0.     1.     0.
#2 2     1995-11-23 2007-09-26     1.     0.     0.
#3 n     2007-02-19 2010-08-06     0.     0.     1.

Sample data

df <- read.table(text =
    "ID   start.date   end.date
1    1999-03-02   2003-06-15
2    1995-11-23   2007-09-26
n    2007-02-19   2010-08-06", header = T)

Upvotes: 1

Related Questions