Reputation: 1722
I have data listing start and end dates for different products for a big number of customers. The intervals for different products can overlap or have time gaps between purchases:
library(lubridate)
library(Hmisc)
library(dplyr)
user_id <- c(rep(12, 8), rep(33, 5))
start_date <- dmy(Cs(31/10/2010, 18/12/2010, 31/10/2011, 18/12/2011, 27/03/2014, 18/12/2014, 27/03/2015, 18/12/2016, 01/07/1992, 20/08/1993, 28/10/1999, 31/01/2006, 26/08/2016))
end_date <- dmy(Cs(31/10/2011, 18/12/2011, 28/04/2014, 18/12/2014, 27/03/2015, 18/12/2016, 27/03/2016, 18/12/2017,
01/07/2016, 16/08/2016, 15/11/2012, 28/02/2006, 26/01/2017))
data <- data.frame(user_id, start_date, end_date)
data
user_id start_date end_date
1 12 2010-10-31 2011-10-31
2 12 2010-12-18 2011-12-18
3 12 2011-10-31 2014-04-28
4 12 2011-12-18 2014-12-18
5 12 2014-03-27 2015-03-27
6 12 2014-12-18 2016-12-18
7 12 2015-03-27 2016-03-27
8 12 2016-12-18 2017-12-18
9 33 1992-07-01 2016-07-01
10 33 1993-08-20 2016-08-16
11 33 1999-10-28 2012-11-15
12 33 2006-01-31 2006-02-28
13 33 2016-08-26 2017-01-26
I'd like to calculate the total number of active days or months during which he/she held any the products.
It wouldn't be a problem if the products ALWAYS overlapped as then I could simply take
data %>%
group_by(user_id) %>%
dplyr::summarize(time_diff = max(end_date) - min(start_date))
However, as you can see in user 33, products don't always overlap and their interval has to be added separately to all 'overlapped' intervals.
Is there a quick and elegant way to code it, hopefully in dplyr
?
Upvotes: 2
Views: 674
Reputation: 17648
Whats about using IRanges
and intersect
?
library(IRanges)
data %>%
group_by(user_id) %>%
summarise(days_held=sum(width(reduce(IRanges(as.numeric(start_date), as.numeric(end_date))))))
# A tibble: 2 × 2
user_id active_days
<dbl> <int>
1 12 2606
2 33 8967
And here the benchmarks using Nathan Wert's big_data
. The IRange method seems to be a little bit faster.
my_result <- function(x) {
x %>%
group_by(user_id) %>%
summarise(days_held=sum(width(reduce(IRanges(as.numeric(start_date), as.numeric(end_date))))))
}
library(microbenchmark)
microbenchmark(
a <- my_result(big_data),
b <- my_answer(big_data), times=2
)
Unit: seconds
expr min lq mean median uq max neval cld
a <- my_result(big_data) 14.97008 14.97008 14.98896 14.98896 15.00783 15.00783 2 a
b <- my_answer(big_data) 17.59373 17.59373 17.76257 17.76257 17.93140 17.93140 2 b
all.equal(a, b)
[1] TRUE
To visualize the ranges you also can plot the data...
library(Gviz)
library(GenomicRanges)
a <- sapply(split(data, data$user_id), function(x) {
AnnotationTrack(start = as.numeric(x$start_date), end = as.numeric(x$end_date),
chromosome = "chrNA", stacking = "full", name = as.character(unique(x$user_id)))
})
plotTracks(trackList = a)
Upvotes: 3
Reputation: 5263
Making a data.frame
is not very efficient, so you can save time by keeping the ranges as Date
vectors.
multi_seq_date <- Vectorize(seq.Date, c('from', 'to'), SIMPLIFY = FALSE)
data %>%
group_by(user_id) %>%
mutate(date_seq = multi_seq_date(start_date, end_date, by = 'day')) %>%
summarise(days_held = length(unique(unlist(date_seq))))
I'm sure there's a more idiomatic tidyverse way to write that, but I'm not a tidyverse guy.
multi_seq_date
will return a list of date sequences. Then it's just a matter of counting the unique days across that list. I ran this and ycw's answer on a large randomly-generated sample set:
# Making the data -----------------------------------
big_size <- 100000
starting_range <- seq(dmy('01-01-1990'), dmy('01-01-2017'), by = 'day')
set.seed(123456)
big_data <- data.frame(
user_id = sample(seq_len(round(big_size / 4)), big_size, replace = TRUE),
start_date = sample(starting_range, big_size, replace = TRUE)
)
big_data$end_date <- big_data$start_date + round(runif(big_size, 1, 500))
# The actual process to test -------------------------
my_answer <- function(x) {
multi_seq_date <- Vectorize(seq.Date, c('from', 'to'), SIMPLIFY = FALSE)
x %>%
group_by(user_id) %>%
mutate(date_seq = multi_seq_date(start_date, end_date, by = 'day')) %>%
summarise(days_held = length(unique(unlist(date_seq))))
}
On my computer, my_answer
took about 13 seconds.
Upvotes: 2
Reputation: 39154
We can use functions from dplyr
to count the total number of days. The following example expands each time period, and then removes duplicated dates. Finally count the total row number for each user_id
.
data2 <- data %>%
rowwise() %>%
do(data_frame(user_id = .$user_id,
Date = seq(.$start_date, .$end_date, by = 1))) %>%
distinct() %>%
ungroup() %>%
count(user_id)
Upvotes: 2