Kasia Kulma
Kasia Kulma

Reputation: 1722

Calculating active days/months from overlapping dates

I have data listing start and end dates for different products for a big number of customers. The intervals for different products can overlap or have time gaps between purchases:

library(lubridate)
library(Hmisc)
library(dplyr)

user_id <- c(rep(12, 8), rep(33, 5))

start_date <- dmy(Cs(31/10/2010,    18/12/2010, 31/10/2011, 18/12/2011, 27/03/2014, 18/12/2014, 27/03/2015, 18/12/2016, 01/07/1992, 20/08/1993, 28/10/1999, 31/01/2006, 26/08/2016))

end_date <- dmy(Cs(31/10/2011,  18/12/2011, 28/04/2014, 18/12/2014, 27/03/2015, 18/12/2016, 27/03/2016, 18/12/2017,
               01/07/2016,  16/08/2016, 15/11/2012, 28/02/2006, 26/01/2017))

data <- data.frame(user_id, start_date, end_date)

data
   user_id start_date   end_date
1       12 2010-10-31 2011-10-31
2       12 2010-12-18 2011-12-18
3       12 2011-10-31 2014-04-28
4       12 2011-12-18 2014-12-18
5       12 2014-03-27 2015-03-27
6       12 2014-12-18 2016-12-18
7       12 2015-03-27 2016-03-27
8       12 2016-12-18 2017-12-18
9       33 1992-07-01 2016-07-01
10      33 1993-08-20 2016-08-16
11      33 1999-10-28 2012-11-15
12      33 2006-01-31 2006-02-28
13      33 2016-08-26 2017-01-26

I'd like to calculate the total number of active days or months during which he/she held any the products.

It wouldn't be a problem if the products ALWAYS overlapped as then I could simply take

data %>% 
group_by(user_id) %>% 
dplyr::summarize(time_diff = max(end_date) - min(start_date))

However, as you can see in user 33, products don't always overlap and their interval has to be added separately to all 'overlapped' intervals.

Is there a quick and elegant way to code it, hopefully in dplyr?

Upvotes: 2

Views: 674

Answers (3)

Roman
Roman

Reputation: 17648

Whats about using IRanges and intersect?

library(IRanges)
data %>% 
  group_by(user_id) %>% 
  summarise(days_held=sum(width(reduce(IRanges(as.numeric(start_date), as.numeric(end_date)))))) 
# A tibble: 2 × 2
  user_id active_days
    <dbl>       <int>
1      12        2606
2      33        8967

And here the benchmarks using Nathan Wert's big_data. The IRange method seems to be a little bit faster.

my_result <- function(x) {
x %>% 
    group_by(user_id) %>% 
    summarise(days_held=sum(width(reduce(IRanges(as.numeric(start_date), as.numeric(end_date)))))) 
}


library(microbenchmark)
microbenchmark(
  a <- my_result(big_data),
  b <- my_answer(big_data), times=2
)
Unit: seconds
                     expr      min       lq     mean   median       uq      max neval cld
 a <- my_result(big_data) 14.97008 14.97008 14.98896 14.98896 15.00783 15.00783     2  a 
 b <- my_answer(big_data) 17.59373 17.59373 17.76257 17.76257 17.93140 17.93140     2   b

all.equal(a, b)
[1] TRUE

Edit

To visualize the ranges you also can plot the data...

library(Gviz)
library(GenomicRanges)
a <- sapply(split(data, data$user_id), function(x) {
  AnnotationTrack(start = as.numeric(x$start_date), end = as.numeric(x$end_date),
                  chromosome = "chrNA", stacking = "full", name = as.character(unique(x$user_id)))
})
plotTracks(trackList = a)

enter image description here

Upvotes: 3

Nathan Werth
Nathan Werth

Reputation: 5263

Making a data.frame is not very efficient, so you can save time by keeping the ranges as Date vectors.

multi_seq_date <- Vectorize(seq.Date, c('from', 'to'), SIMPLIFY = FALSE)

data %>%
  group_by(user_id) %>%
  mutate(date_seq = multi_seq_date(start_date, end_date, by = 'day')) %>%
  summarise(days_held = length(unique(unlist(date_seq))))

I'm sure there's a more idiomatic tidyverse way to write that, but I'm not a tidyverse guy.

multi_seq_date will return a list of date sequences. Then it's just a matter of counting the unique days across that list. I ran this and ycw's answer on a large randomly-generated sample set:

# Making the data -----------------------------------
big_size <- 100000
starting_range <- seq(dmy('01-01-1990'), dmy('01-01-2017'), by = 'day')

set.seed(123456)
big_data <- data.frame(
  user_id    = sample(seq_len(round(big_size / 4)), big_size, replace = TRUE),
  start_date = sample(starting_range, big_size, replace = TRUE)
)
big_data$end_date <- big_data$start_date + round(runif(big_size, 1, 500))


# The actual process to test -------------------------
my_answer <- function(x) {
  multi_seq_date <- Vectorize(seq.Date, c('from', 'to'), SIMPLIFY = FALSE)
  x %>%
    group_by(user_id) %>%
    mutate(date_seq = multi_seq_date(start_date, end_date, by = 'day')) %>%
    summarise(days_held = length(unique(unlist(date_seq))))
}

On my computer, my_answer took about 13 seconds.

Upvotes: 2

www
www

Reputation: 39154

We can use functions from dplyr to count the total number of days. The following example expands each time period, and then removes duplicated dates. Finally count the total row number for each user_id.

data2 <- data %>%
  rowwise() %>%
  do(data_frame(user_id = .$user_id, 
     Date = seq(.$start_date, .$end_date, by = 1))) %>%
  distinct() %>%
  ungroup() %>%
  count(user_id)

Upvotes: 2

Related Questions