Reputation: 1722
I'm trying to solve the problem I posted here using data.table
package or other solutions dealing efficiently with big data (14-22 million rows). Any hints on how to speed this solution up or find a quicker workaround?
Thanks a lot for your help!
Upvotes: 2
Views: 195
Reputation: 3650
1) Lets multiply data:
d <- replicate(1e2, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, user_id := paste0(user_id, .id)]
.id user_id start_date end_date
1: 1 121 2010-10-31 2011-10-31
2: 1 121 2010-12-18 2011-12-18
3: 1 121 2011-10-31 2014-04-28
4: 1 121 2011-12-18 2014-12-18
5: 1 121 2014-03-27 2015-03-27
---
1296: 100 33100 1992-07-01 2016-07-01
1297: 100 33100 1993-08-20 2016-08-16
1298: 100 33100 1999-10-28 2012-11-15
1299: 100 33100 2006-01-31 2006-02-28
1300: 100 33100 2016-08-26 2017-01-26
2) write function from previous post:
yourFunction <- function(data){
data %>%
rowwise() %>%
do(data_frame(user_id = .$user_id,
Date = seq(.$start_date, .$end_date, by = 1))) %>%
distinct() %>%
ungroup() %>%
count(user_id)
}
rez1 <- yourFunction(d)
rez1
# A tibble: 200 x 2
user_id n
<chr> <int>
1 121 2606
2 1210 2606
3 12100 2606
4 1211 2606
5 1212 2606
6 1213 2606
7 1214 2606
8 1215 2606
9 1216 2606
10 1217 2606
# ... with 190 more rows
3) my data.table
approach:
myFunction <- function(data){
setDT(data)
seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
data[, n:= seq2(start_date, end_date)]
d <- data[, .(day = unlist(n)), by = user_id]
d[, .(n = uniqueN(day)), by = user_id]
}
rez2 <- myFunction(d)
3) Test if results are equal:
setDT(rez1)
setorder(rez1, user_id)
setorder(rez2, user_id)
all.equal(rez1, rez2)
[1] TRUE
4) BENCHMARKS:
cols <- c("test", "replications", "elapsed", "relative")
rbenchmark::benchmark(yourFunction(d),
myFunction(d), replications = 1, columns = cols)
test replications elapsed relative
1 yourFunction(d) 1 10.23 42.625
2 myFunction(d) 1 0.24 1.000
5) Lets try with bigger data:
d <- replicate(1e5, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, .N]
[1] 1300000
d[, user_id := paste0(user_id, .id)]
system.time(rez3 <- myFunction(d))
Have not yet finished....
6) We can get a great increase in speed if we firstly convert the dates to integer
. My approach nr.2:
myFunction2 <- function(data){
setDT(data)
seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
startD <- as.integer(data[["start_date"]])
endD <- as.integer(data[["end_date"]])
seqences <- seq2(startD, endD)
data[, n:= seqences]
d <- data[, .(day = unlist(n)), by = user_id]
d[, .(n = uniqueN(day)), by = user_id]
}
7) Now we can compere to my fist function using bigger data than previously:
d <- replicate(1e4, data, simplify = F)
d <- rbindlist(d, use.names = T, fill = T, idcol = T)
d[, user_id := paste0(user_id, .id)]
d[, .N]
[1] 130000
### BENCHMARK
test replications elapsed relative
2 rez1 <- myFunction(d) 1 91.19 7.657
1 rez2 <- myFunction2(d) 1 11.91 1.000
all.equal(rez1, rez2)
[1] TRUE
9) It was a mistake to do unlist
and uniqueN
separately, if we combine that in one single data.table
call, we reduce memory usage and increase speed by approximately 3-4 times:
myFunction3 <- function(data){
setDT(data)
seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
startD <- as.integer(data[["start_date"]])
endD <- as.integer(data[["end_date"]])
seqences <- seq2(startD, endD)
data[, n:= seqences]
data[, .(n = uniqueN(unlist(n))), by = user_id]
}
rbenchmark::benchmark(rez2 <- myFunction2(d),
rez1 <- myFunction3(d), replications = 1, columns = cols)
test replications elapsed relative
2 rez1 <- myFunction3(d) 1 4.19 1.000
1 rez2 <- myFunction2(d) 1 14.06 3.356
10)
With this last approach I can process 1.3 million rows in ~25 seconds.
With this last approach I can process 0.78 million rows in ~1 minute(depending on memory).
11) original vs last: (on 1300 rows)
test replications elapsed relative
1 yourFunction(d) 1 10.22 340.667
2 myFunction3(d) 1 0.03 1.000
12) Maybe this function can increase speed a bit:
myFunction5 <- function(d){
setDT(d)
setkey(d, user_id)
seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
startD <- as.integer(d[["start_date"]])
endD <- as.integer(d[["end_date"]])
seqences <- seq2(startD, endD)
dd <- d[, .(list(.I)), by = user_id]
indlist <- dd[[2]]
mf <- function(x) uniqueN(unlist(x))
ff <- function(x) mf(seqences[x])
ff2 <- Vectorize(ff, "x")
r <- ff2(indlist)
data.table(user_id = dd[[1]], n = r, key = "user_id")
}
test replications elapsed relative
1 myFunction3(d) 1 3.71 1.22
2 myFunction4(d1) 1 3.04 1.00
Upvotes: 5
Reputation: 38500
If I understand your question, which is count the number of unique days for each ID, an alternative using Map
to construct the sequential dates is
setDT(data)[, .(cnt=uniqueN(unlist(Map(seq, start_date, end_date, by="day")))), by=user_id]
user_id cnt
1: 12 2606
2: 33 8967
Upvotes: 2
Reputation: 7784
This method keeps seq outside the inner loop, but has the unfortunate consequence of being memory hungry, and so breaks down at about 1e5. But depending on your number of user_ids and date range entries, this might be faster:
> d[, .SD
][, .(date=seq(from=min(start_date), to=max(end_date), by=1))
][d, .(user_id=i.user_id, start_date=i.start_date, end_date=i.end_date, date=x.date), on=.(date >= start_date, date <= end_date), allow.cartesian=T
][, unique(.SD, by=c('user_id', 'date'))
][, .N, user_id
][order(user_id)
]
Upvotes: 1