Reputation: 493
A few days ago I opened this related thread: Time-interval overlap match by group
However, now I have to deal with the fact that I need to overlap multiple time-intervals columns and return the first row_number value of flag = 1 when this occurs.
For instance I have the following df:
id flag row_number time_1 time_2 result
1 1 1 2001-04-01 UTC--2001-05-01 UTC 1960-01-01 UTC--1962-01-01 UTC NA
1 1 2 2007-08-01 UTC--2007-12-01 UTC 1980-01-01 UTC--1982-01-01 UTC NA
1 1 3 2010-03-01 UTC--2011-03-01 UTC 1949-01-01 UTC--1951-01-01 UTC NA
1 0 4 2001-04-15 UTC--2001-04-20 UTC 1981-01-01 UTC--1983-01-01 UTC NA
1 0 5 2001-04-17 UTC--2001-05-15 UTC 1959-01-01 UTC--1961-01-01 UTC 1
1 0 6 2007-09-01 UTC--2007-12-01 UTC 1980-01-01 UTC--1983-01-01 UTC 2
1 0 7 2011-01-01 UTC--2011-03-05 UTC 1994-01-01 UTC--1996-01-01 UTC NA
1 0 8 2018-01-01 UTC--2017-12-01 UTC 1949-01-01 UTC--1951-01-01 UTC NA
Created with the following code:
library(dplyr)
library(purrr)
library(lubridate)
df <- data.frame(id=c(1, 1, 1, 1, 1, 1, 1, 1),
flag=c(1, 1, 1, 0, 0, 0, 0, 0),
row_number=c(1,2,3,4,5,6,7,8),
time_1=c(interval(ymd(20010401), ymd(20010501)),
interval(ymd(20070801), ymd(20071201)),
interval(ymd(20100301), ymd(20110301)),
interval(ymd(20010415), ymd(20010420)),
interval(ymd(20010417), ymd(20010515)),
interval(ymd(20070801), ymd(20071201)),
interval(ymd(20110101), ymd(20110305)),
interval(ymd(20180101), ymd(20171201))),
time_2=c(interval(ymd(19600101), ymd(19620101)),
interval(ymd(19800101), ymd(19820101)),
interval(ymd(19490101), ymd(19510101)),
interval(ymd(19810101), ymd(19830101)),
interval(ymd(19590101), ymd(19610101)),
interval(ymd(19800101), ymd(19820101)),
interval(ymd(19940101), ymd(19960101)),
interval(ymd(19490101), ymd(19510101))),
result = c(NA, NA, NA, NA, 1, 2, NA, NA))
This is, I need to find overlaps with time_1 and time_2 of rows with flag = 0 with all time_1 and time_2 variables of rows with flag = 1.
The result should be a column with the row_number value of first match between the row with flag 0 and a row with flag 1 that has an overlapping time_1 and time_2 intervals. For this purpose I have tried int_overlap() function from lubridate package.
With this code I am able to identify if there is time_1 overlap between one row with flag = 0 to any row with flag == 1, taking advantage of map_int() function
library(tidyverse)
library(lubridate)
df %>%
group_by(id) %>%
mutate(value = ifelse(flag == 0, map_int(time_1, ~ any(int_overlaps(.x, time_1[flag == 1]))), NA))
A related question which can be helpful: R Find overlap among time periods
EDIT: I want to obtain a column that identifies, with row_number variable, which is the first flag 1 row that have time_1 and time_2 overlapping values with flag 0 rows.
id flag row_number time_1 time_2 result
1 1 1 2001-04-01 UTC--2001-05-01 UTC 1960-01-01 UTC--1962-01-01 UTC NA
1 0 5 2001-04-17 UTC--2001-05-15 UTC 1959-01-01 UTC--1961-01-01 UTC 1
For instance row_number 1 and 5 fulfill the criteria. The result is an integer column which indicates that row_number 5 (flag 0 row) have time_1 and time_2 overlap with row_number 1 (flag 1).
Hope this clarifies.
Upvotes: 3
Views: 584
Reputation: 25225
Here is an option using data.table
by performing overlapping joins twice:
setkey(setDT(df), id, time_1_start, time_1_end)
ol1 <- foverlaps(df, df, nomatch=0L)[
row_number!=i.row_number & i.flag==0L & flag==1L,
.(id, irn=i.row_number, rn=row_number, flag=i.flag,
time_2_start=i.time_2_start, time_2_end=i.time_2_end)]
setkey(df, id, time_2_start, time_2_end)
setkey(ol1, id, time_2_start, time_2_end)
olaps <- foverlaps(ol1, df)[row_number!=irn & row_number==rn & i.flag==0L & flag==1L,
.(id, irn, xrn=row_number)]
df[olaps, on=.(id, row_number=irn), res := xrn]
setorder(df, row_number)
df
output:
id flag row_number time_1_start time_1_end time_2_start time_2_end res
1: 1 1 1 2001-04-01 2001-05-01 1960-01-01 1962-01-01 NA
2: 1 1 2 2007-08-01 2007-12-01 1980-01-01 1982-01-01 NA
3: 1 1 3 2010-03-01 2011-03-01 1949-01-01 1951-01-01 NA
4: 1 0 4 2001-04-15 2001-04-20 1981-01-01 1983-01-01 NA
5: 1 0 5 2001-04-17 2001-05-15 1959-01-01 1961-01-01 1
6: 1 0 6 2007-08-01 2007-12-01 1980-01-01 1982-01-01 2
7: 1 0 7 2011-01-01 2011-03-05 1994-01-01 1996-01-01 NA
8: 1 0 8 2017-12-01 2018-01-01 1949-01-01 1951-01-01 NA
data:
library(data.table)
dtfun <- function(x) as.IDate(x, format="%Y%m%d")
df <- data.frame(id=c(1, 1, 1, 1, 1, 1, 1, 1),
flag=c(1, 1, 1, 0, 0, 0, 0, 0),
row_number=c(1,2,3,4,5,6,7,8),
time_1_start=dtfun(c("20010401","20070801","20100301","20010415",
"20010417","20070801","20110101","20171201")),
time_1_end=dtfun(c("20010501","20071201","20110301","20010420","
20010515","20071201","20110305","20180101")),
time_2_start=dtfun(c("19600101","19800101","19490101","19810101",
"19590101","19800101","19940101","19490101")),
time_2_end=dtfun(c("19620101","19820101","19510101","19830101",
"19610101","19820101","19960101","19510101")))
Upvotes: 2
Reputation: 951
I'm pretty sure I dont understand exactly what you are looking for. In your data, time_1 and time_2 are very far apart and never intersect. Is that correct?
Maybe this will get the ball rolling. Is this what you want?
df %>%
mutate(test = case_when(
int_overlaps(time_1,time_2) & flag == 1 ~ T,
int_overlaps(time_1,time_2) & flag == 0 ~ F,
T ~ NA
))
Upvotes: 0