Kasia Kulma
Kasia Kulma

Reputation: 1722

Calculating active days from overlapping dates using data.table

I'm trying to solve the problem I posted here using data.table package or other solutions dealing efficiently with big data (14-22 million rows). Any hints on how to speed this solution up or find a quicker workaround?

Thanks a lot for your help!

Upvotes: 2

Views: 195

Answers (3)

minem
minem

Reputation: 3650

1) Lets multiply data:

d <- replicate(1e2, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, user_id := paste0(user_id, .id)]
      .id user_id start_date   end_date
   1:   1     121 2010-10-31 2011-10-31
   2:   1     121 2010-12-18 2011-12-18
   3:   1     121 2011-10-31 2014-04-28
   4:   1     121 2011-12-18 2014-12-18
   5:   1     121 2014-03-27 2015-03-27
  ---                                  
1296: 100   33100 1992-07-01 2016-07-01
1297: 100   33100 1993-08-20 2016-08-16
1298: 100   33100 1999-10-28 2012-11-15
1299: 100   33100 2006-01-31 2006-02-28
1300: 100   33100 2016-08-26 2017-01-26

2) write function from previous post:

yourFunction <- function(data){
  data %>%
    rowwise() %>%
    do(data_frame(user_id = .$user_id, 
                  Date = seq(.$start_date, .$end_date, by = 1))) %>%
    distinct() %>%
    ungroup() %>%
    count(user_id)
}

rez1 <- yourFunction(d)
rez1
    # A tibble: 200 x 2
   user_id     n
     <chr> <int>
 1     121  2606
 2    1210  2606
 3   12100  2606
 4    1211  2606
 5    1212  2606
 6    1213  2606
 7    1214  2606
 8    1215  2606
 9    1216  2606
10    1217  2606
# ... with 190 more rows

3) my data.table approach:

myFunction <- function(data){
  setDT(data)
  seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
  data[, n:= seq2(start_date, end_date)]
  d <- data[, .(day = unlist(n)), by = user_id]
  d[, .(n = uniqueN(day)), by = user_id]
}
rez2 <- myFunction(d)

3) Test if results are equal:

setDT(rez1)
setorder(rez1, user_id)
setorder(rez2, user_id)
all.equal(rez1, rez2)
[1] TRUE

4) BENCHMARKS:

cols <- c("test", "replications", "elapsed", "relative")
rbenchmark::benchmark(yourFunction(d),
                      myFunction(d), replications = 1, columns = cols)
             test replications elapsed relative
1 yourFunction(d)            1   10.23   42.625
2   myFunction(d)            1    0.24    1.000

5) Lets try with bigger data:

d <- replicate(1e5, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, .N]
[1] 1300000
d[, user_id := paste0(user_id, .id)]

system.time(rez3 <- myFunction(d))

Have not yet finished....

UPDATE:

6) We can get a great increase in speed if we firstly convert the dates to integer. My approach nr.2:

  myFunction2 <- function(data){
    setDT(data)
    seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
    startD <- as.integer(data[["start_date"]])
    endD <- as.integer(data[["end_date"]])
    seqences <- seq2(startD, endD)
    data[, n:= seqences]
    d <- data[, .(day = unlist(n)), by = user_id]
    d[, .(n = uniqueN(day)), by = user_id]
  }

7) Now we can compere to my fist function using bigger data than previously:

d <- replicate(1e4, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, user_id := paste0(user_id, .id)]
d[, .N]
[1] 130000
### BENCHMARK
                    test replications elapsed relative
2  rez1 <- myFunction(d)            1   91.19    7.657
1 rez2 <- myFunction2(d)            1   11.91    1.000
all.equal(rez1, rez2)
[1] TRUE

UPDATE2:

9) It was a mistake to do unlist and uniqueN separately, if we combine that in one single data.table call, we reduce memory usage and increase speed by approximately 3-4 times:

myFunction3 <- function(data){
    setDT(data)
    seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
    startD <- as.integer(data[["start_date"]])
    endD <- as.integer(data[["end_date"]])
    seqences <- seq2(startD, endD)
    data[, n:= seqences]
    data[, .(n = uniqueN(unlist(n))), by = user_id]
  }

rbenchmark::benchmark(rez2 <- myFunction2(d),
                      rez1 <- myFunction3(d), replications = 1, columns = cols)
                    test replications elapsed relative
2 rez1 <- myFunction3(d)            1    4.19    1.000
1 rez2 <- myFunction2(d)            1   14.06    3.356

10) With this last approach I can process 1.3 million rows in ~25 seconds.

With this last approach I can process 0.78 million rows in ~1 minute(depending on memory).

11) original vs last: (on 1300 rows)

             test replications elapsed relative
1 yourFunction(d)            1   10.22  340.667
2  myFunction3(d)            1    0.03    1.000

UPDATE3:

12) Maybe this function can increase speed a bit:

myFunction5 <- function(d){
  setDT(d)
  setkey(d, user_id)
  seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
  startD <- as.integer(d[["start_date"]])
  endD <- as.integer(d[["end_date"]])
  seqences <- seq2(startD, endD)
  dd <- d[, .(list(.I)), by = user_id]
  indlist <- dd[[2]]
  mf <- function(x) uniqueN(unlist(x))
  ff <- function(x) mf(seqences[x])
  ff2 <- Vectorize(ff, "x")
  r <- ff2(indlist)
  data.table(user_id = dd[[1]], n = r, key = "user_id")
}
             test replications elapsed relative
1  myFunction3(d)            1    3.71     1.22
2 myFunction4(d1)            1    3.04     1.00

Upvotes: 5

lmo
lmo

Reputation: 38500

If I understand your question, which is count the number of unique days for each ID, an alternative using Map to construct the sequential dates is

setDT(data)[, .(cnt=uniqueN(unlist(Map(seq, start_date, end_date, by="day")))), by=user_id]
   user_id  cnt
1:      12 2606
2:      33 8967

Upvotes: 2

Clayton Stanley
Clayton Stanley

Reputation: 7784

This method keeps seq outside the inner loop, but has the unfortunate consequence of being memory hungry, and so breaks down at about 1e5. But depending on your number of user_ids and date range entries, this might be faster:

> d[, .SD
   ][, .(date=seq(from=min(start_date), to=max(end_date), by=1))
   ][d, .(user_id=i.user_id, start_date=i.start_date, end_date=i.end_date, date=x.date), on=.(date >= start_date, date <= end_date), allow.cartesian=T
   ][, unique(.SD, by=c('user_id', 'date'))
   ][, .N, user_id
   ][order(user_id)
   ]

Upvotes: 1

Related Questions