Werther
Werther

Reputation: 133

Determine how many time intervals intersect with every given time interval (R)

I am working with these data on R. These are the first six rows —without counting the first column, which the write.csv function always adds—:

> head(my_data)
client_id contract_id contract_start contract_end inter_complex
        1           1     15/07/2019   15/07/2020  18092+18458i
        3           3      1/01/2015    1/01/2015  16436+16436i
        5           5     12/06/2020   12/06/2020  18425+18425i
       13          13      1/01/2015    1/01/2015  16436+16436i
       18          18      1/01/2015    1/01/2015  16436+16436i
       19          19      1/01/2015    1/01/2015  16436+16436i

Each row represents a different contract. The variable inter_complex is a complex number whose real part is the numeric representation of the date when a contract started, and whose imaginary part analogously represents the date when a contract ended. In case you're wondering, you can obtain that column by executing this:

library(tidyverse)
library(lubridate)

chars_2_cplex = function(start, end) {
    cbind(start, end) %>%
    apply(2, compose(as.numeric, dmy)) %*% rbind(1, 1i)
}

my_data %>% transmute(inter_complex = chars_2_cplex(contract_start, contract_end))

What I want is, for each client id and each contract, to identify how many contracts associated to that same client id intersect with that contract. In other words: I want to create a new column called simultaneous which will depict for each row —i.e. for each contract— how many active contracts the corresponding client has during the very same period that the current contract is active. In case no intersection with any other contract is found for a given contract, then the value of simultaneous would have to be 1 —as while that contract is active it is also the only active contract that the respective client has—.

I figured it would help to obtain the combinations of inter_complex, then turn those combinations of complex numbers into combinations of intervals, and then use lubridate's intersect function to discern whether or not each combination of intervals intersect. For that purpose, I have written the following code:

## This function turns complex numbers into intervals.
cplex_2_inter = function(x) {
    start = x %>% Re() %>% as.integer()
    end = x %>% Im() %>% as.integer()

    interval(as_date(start), as_date(end))
}


## This other function returns a list whose j-th element is a data frame that shows the interceptions
## between combinations of j + 1 intervals.
get_intersections = function(x) {
    max_m = length(x)
    output = vector(mode = "list", length = max_m - 1)

    for (i in 2:max_m) {
        output[[i - 1]] = combn(x, m = i) %>% t() %>% as.data.frame() %>% 
                          mutate_all(cplex_2_inter) %>% rowid_to_column("id") %>%
                          pivot_longer(-id) %>% group_by(id) %>% 
                          mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>%
                          mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", i, 1L))
    }

    return(output)
}

In order to get a better grasp on what the function get_intersections does, I propose that you run the following:

example = my_data %>% filter(client_id == 1) %>% pull(inter_complex) %>% get_intersections()

The data frame example[[1]] shows whether there are interceptions —or, for a better word, overlaps— between pairs of intervals. The data frame example[[2]] shows whether there are overlaps between groups of three intervals, and so on.

You may notice that according to example[[1]] the interval 2019-07-15 UTC--2020-07-15 UTC overlaps with some other interval —and hence, the associated value of simultaneous is 2— while, according to example[[2]], that very same interval is associated to a value of 3 for the variable simultaneous. Naturally, the idea is to assign to each interval its highest simultaneous value.

Since I do not care about global overlaps but rather about overlaps within each client id I figured I would require to work on a grouped data frame. The furthest I got on this project was writing this:

my_data %>% group_by(client_id) %>% group_map(~ get_intersections(.x$inter_complex))

Now onto my questions. 1) I have executed the line above, but the process is not very efficient. It has already run for a bit more than a whole day and it doesn't finish yet. Recently I came across the concept of interval trees but I'm not a computer scientist and I would require help in order to tackle this problem in a smarter way. 2) In case we stick to my not-so-smart approach to the problem, I would still require a function that accesses each element of the list that is returned by get_intersections, so as to identify and retrieve the highest simultaneous value associated to each interval. On that matter I would have to request help as well.

Edit

Regarding Wimpel's answer, I have examined their data table and I found this.

> DT %>% filter(client_id == 502 & contract_id == 3093) %>%
> select(contract_start, contract_end, contract_intersect)
# Output
   contract_start contract_end contract_intersect
1:     2018-01-11   2019-01-11                  7

That is, the displayed contract allegedly overlaps with seven other contracts that the same client has.

On the other hand, let's see if this holds true when using my combinations-based approach.

combs_10_502 = my_data %>% filter(client_id == 502) %>% pull(inter_complex) %>% 
               combn(10) %>% t() %>% as.data.frame() %>% mutate_all(cplex_2_inter) %>% 
               rowid_to_column("id") %>% pivot_longer(-id) %>% group_by(id) %>% 
               mutate(simultaneous = do.call(lubridate::intersect, as.list(value))) %>% 
               ungroup() %>% 
               mutate(simultaneous = if_else(as.character(simultaneous) != "NA--NA", 10L, 1L))    

> combs_10_502 %>% filter(simultaneous == 10) %>% slice(11:20)
# A tibble: 10 x 4
  id    name  value                          simultaneous
  <int> <chr> <Interval>                            <int>
1  24311 V1    2018-01-11 UTC--2019-01-11 UTC        10
2  24311 V2    2018-03-01 UTC--2019-03-01 UTC        10
3  24311 V3    2018-07-11 UTC--2019-07-11 UTC        10
4  24311 V4    2018-04-20 UTC--2019-04-20 UTC        10
5  24311 V5    2018-05-21 UTC--2019-05-21 UTC        10
6  24311 V6    2018-08-10 UTC--2019-08-10 UTC        10
7  24311 V7    2018-08-09 UTC--2019-08-09 UTC        10
8  24311 V8    2018-09-27 UTC--2019-09-27 UTC        10
9  24311 V9    2020-01-03 UTC--2021-01-03 UTC        10
10 24311 V10   2019-12-19 UTC--2020-12-19 UTC        10

The same contract is displayed on the first row of the tibble above. As can be seen, that contract actually overlaps with nine other contracts of the given client —those nine are displayed on the remaining rows—.

I don't know how Wimpel's solution got this wrong, but I checked that it does get the number of intersections right for several other contracts. Now I know that a data table-based solution is what I am looking for, since the processes are made very fast, but there seems to be an issue with the proposed solution.

Upvotes: 2

Views: 488

Answers (1)

Wimpel
Wimpel

Reputation: 27772

I believe you are looking for something like this?

library(data.table)
DT <- fread("https://raw.githubusercontent.com/pazos-feren/Data/main/contracts.csv")
#set dates as real dates
DT[, contract_start := as.Date(contract_start, format = "%d/%m/%Y")]
DT[, contract_end := as.Date(contract_end, format = "%d/%m/%Y")]

setkey(DT, V1)

DT[DT, c("contract_intersect", "contract_intersect_ids") := {
  val = DT[ !V1 == i.V1 & client_id == i.client_id &
              contract_start <= i.contract_end & contract_end >= i.contract_start, ]
  list( nrow(val), paste0(val$contract_id, collapse = ";") )
}, by = .EACHI]

#    V1 client_id contract_id contract_start contract_end inter_complex contract_intersect contract_intersect_ids
# 1:  1         1           1     2019-07-15   2020-07-15  18092+18458i                  2              4162;4168
# 2:  2         3           3     2015-01-01   2015-01-01  16436+16436i                  0                       
# 3:  3         5           5     2020-06-12   2020-06-12  18425+18425i                  0                       
# 4:  4        13          13     2015-01-01   2015-01-01  16436+16436i                  0                       
# 5:  5        18          18     2015-01-01   2015-01-01  16436+16436i                  0                       
# 6:  6        19          19     2015-01-01   2015-01-01  16436+16436i                  0                       

Upvotes: 0

Related Questions