John
John

Reputation: 43279

Identify gaps in a continuous time period

I have a dataframe with some observations of when lines attached to IDs. I need the period of time in days when each ID had a line/catheter attached.

Here is my dput return:

structure(list(ID = c(487622L, 487622L, 487639L, 487639L, 489027L, 
489027L, 489027L, 491858L, 491858L, 491858L, 491858L, 491858L, 
491858L), Line = c("Central Venous Line", "Central Venous Line", 
"Central Venous Line", "Peripherally Inserted Central Catheter (PICC)", 
"Haemodialysis Catheter", "Peripherally Inserted Central Catheter (PICC)", 
"Haemodialysis Catheter", "Central Venous Line", "Haemodialysis Catheter", 
"Central Venous Line", "Haemodialysis Catheter", "Central Venous Line", 
"Peripherally Inserted Central Catheter (PICC)"), Start = structure(c(1362528000, 
1363219200, 1362268800, 1363219200, 1364774400, 1365120000, 1365465600, 
1364688000, 1364688000, 1365724800, 1365724800, 1366848000, 1369353600
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), End = structure(c(1362787200, 
1363824000, 1363305600, 1363737600, 1365465600, 1366675200, 1365638400, 
1365724800, 1365724800, 1366329600, 1366848000, 1367539200, 1369612800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Days = c("3.095138889", 
"7.045138889", "11.87777778", "5.736111111", "7.850694444", "18.02083333", 
"1.813888889", "12.32986111", "12.71388889", "6.782638889", "13.14027778", 
"7.718055556", "3.397222222"), dateOrder = c(1L, 2L, 1L, 2L, 
1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L)), .Names = c("ID", "Line", 
"Start", "End", "Days", "dateOrder"), row.names = 79:91, class = "data.frame")

Here is the catch. It does not matter if an ID has more than one line/catheter. I just need to take the earliest start date for each ID, the latest end date for each ID, and calculate the number of continuous days each ID has a line/catheter attached.

The problem is confounded by some cases, e.g. ID 491858. This individual had a line removed (dateOrder = 5) on 2013-05-03 and reinserted on 2013-05-24 for just over 3 days.

How I intended to handle this is to subtract the gap (number of days) from the number of days of continuous time between min(Start Date) and max(end date).

There are over 20,000 records in the data set.

Here is what I have done so far:

Converted the DF to a list of DFs based on ID. I intended to apply a function to each DF something as follows:

If the difference in time (days) between subsequent start date and previous end date for each row exceeds 0, then add TRUE or some arbitrary column value to each data frame.

function(y){
    for (i in length(y)){
        if(difftime(y$Start[i+1], y$End[i], units='days') > 0){

            y$test <- TRUE}
        }
    }

Any help would be greatly appreciated.

Thanks.

UPDATE

Ignore the days column. It is of no use. I intend to aggregate month line counts from the unique cases.

Upvotes: 1

Views: 940

Answers (2)

alexis_laz
alexis_laz

Reputation: 13122

I guess something like this might help, unless I've misunderstood something:

unlist(lapply(split(DF, DF$ID), 
  function(x) { totaldays <- max(x$End) - min(x$Start);
   x$Start <- c(x$Start[-1], NA);
   res <- difftime(x$Start[-length(x$Start)], x$End[-length(x$Start)], units = "days");
   res <- res[res > 0];
   res <- ifelse(length(res) == 0, 0, res);
   return(as.numeric(totaldays - res)) }))
#487622 487639 489027 491858 
#    10     17     22     36 

DF is your dput.

Upvotes: 1

Stedy
Stedy

Reputation: 7469

If I understand correctly, you want the total amount of days that the catheter was present. To do that, I would use plyr

#assume df is your dput object

library(plyr)
day.summary <- ddply(df, "ID", function(x) data.frame(total.days = sum(as.numeric(x$Days))))
print(day.summary)
      ID total.days
1 487622   10.14028
2 487639   17.61389
3 489027   27.68542
4 491858   56.08194

Upvotes: 0

Related Questions