ceefel
ceefel

Reputation: 153

Identify gaps in time data

EDIT:

I found a way for the problem below, however, it works on a small dataset but still creates falses output on large datasets. Someone knows why? I can't find the mistake. Here's the code:

df$continuous <-
  unlist(lapply(split(df, df$ID),
                function(x) {
                  sapply(1:nrow(x),
                         function(y) {
                           any(x$start[y] - x$end[-(y:NROW(x$end))] <= 1)
                         })
                }))

ORIGINAL PROBLEM: I'm working on a function to identify a gap in a series of start/end dates. The output should be FALSE if a start date begins later than 1 day after any of the previous end dates.

DATA:

df <- data.frame('ID' = c('1','1','1','1','1','1'), 'start' = as.Date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),
                 'end' = as.Date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))

This is my attempt to solve this with x = start and y = end:

my_fun <- function(x,y){
  any(x[i] - y[1:NROW(i)-1] <= 1)
}

It works well if I specify i but I don't manage to wrap this into a loop. Ultimately, this function should be applied to groups in a large dataset in a dplyr manner.

This is what it should look like:

  ID      start        end  continuous
1  1 2010-01-01 2010-01-03 FALSE #or TRUE
2  1 2010-01-03 2010-01-22 TRUE
3  1 2010-01-05 2010-01-07 TRUE
4  1 2010-01-09 2010-01-12 TRUE
5  1 2010-02-01 2010-02-10 FALSE
6  1 2010-02-10 2010-02-12 TRUE #according to my function or FALSE compared to start[1] would be even better

I'd very much appreciate some help with this.

Upvotes: 2

Views: 2860

Answers (2)

zwep
zwep

Reputation: 1340

EDIT: I just re-read your question... and you wanted to compare the start with all the end dates. Then try to use the function expand.grid

df <- data.frame('ID' = c('1','1','1','1','1','1'), 'start' = as.Date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),
                 'end' = as.Date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))

df_expand <- as.data.table(expand.grid(df$start,df$end))

This creates all the combinations between start/end

names(df_expand) = c("start","end")
df_expand[,ID:= 1]
df_expand[,diff_days := difftime(end,start,unit = "days")]
df_expand[,acceptable := any(diff_days<1), by = start]

Here we define what is acceptable..

df = merge(df,df_expand,by = c("start","end","ID"),all.x=TRUE)

I thought.. we might want to keep the original data, so we merge it back and shrink the information that we have.

Wait.. can you provide a proper example of what you expect and what is needed? I just read your other comment and now it confuses me...

Upvotes: 1

Andrew Brēza
Andrew Brēza

Reputation: 8317

You can do this using dplyr and lubridate. dplyr has really useful window functions like lag() that are handy for this type of analysis.

library(tidyverse)
library(lubridate)

df %>% 
  mutate(start - lag(end, 1) == 0)

# ID      start        end start - lag(end, 1) == 0
# 1  1 2010-01-01 2010-01-03                       NA
# 2  1 2010-01-03 2010-01-22                     TRUE
# 3  1 2010-01-05 2010-01-07                    FALSE
# 4  1 2010-01-09 2010-01-12                    FALSE
# 5  1 2010-02-01 2010-02-10                    FALSE
# 6  1 2010-02-10 2010-02-12                     TRUE

How do you want to handle the first row of your data? Since there is no previous value, it shows NA. This is generally how you should handle situations like this but I can edit my answer if you'd like it to have a different value.

Upvotes: 3

Related Questions