Reputation: 93
Given the following data.table with financial data (35 million rows):
DT:
userId Date balance overdraft (boolean)
600 2014-11-01 -100 1
600 2014-11-02 1000 0
600 2014-11-03 -100 1
600 2014-11-04 -100 1
600 2014-11-05 100 0
600 2014-11-06 100 0
700 2014-11-01 -100 1
700 2014-11-02 1000 0
700 2014-11-03 -100 1
700 2014-11-04 -100 1
700 2014-11-05 -100 1
700 2014-11-06 100 0
The case:
a.- Total number of max. consecutive overdraft days by userId.
userId maxConsecutiveOverdraftDays
600 2
700 3
800 0
900 1
1000 5
In this case I done the following:
acum = FALSE
for (i in 1:nrow(DT)) {
if (DT[i]$overdraft == 1 ) {
if (acum == TRUE)
{
DT[i]$acumBalance <- DT[i]$balance + DT[i-1]$balance
DT[i]$totalConsecutiveOverdraftDays <- DT[i]$overdraft + DT[i-1]$overdraft
}
if (DT[i]$userId == DT[i+1]$userId
&& DT[i+1]$overdraft == 1 )
{
acum = TRUE
}
else { acum = FALSE }
}
}
DT[,maxConsecutiveOverdraftDays:=max(totalConsecutiveOverdraftDays),by=userId]
It takes more than 12 hours to finish.
How can I improve the code and reduce the computation time?
Thanks in advance.
Upvotes: 2
Views: 172
Reputation: 118779
7)
rleid()
, a convenience function for generating a run-length type id column to be used in grouping operations is now implemented. Closes #686. Check?rleid
examples section for usage scenarios.
With this function, we can do:
require(data.table) ## 1.9.5+
dt[, .(overdraft = overdraft[1L], .N), by=.(userId,
rleid(overdraft))][overdraft == 1L, max(N), by=userId]
There's a FR #686 open to implement such a function. But this'll have to be implemented along with fast rank function. We have not gotten to it yet.
Until then, you can do it this way:
dt[, .(overdraft = overdraft[1L], .N), by=.(userId,
cumsum(c(TRUE, diff(overdraft) != 0L)))][overdraft == 1L, max(N), by=userId]
# userId V1
# 1: 600 2
# 2: 700 3
Edit: Corrected logic as pointed out by @Dirk. Thanks!
Upvotes: 2
Reputation: 1272
This one-liner should do the trick:
overdraft[, daysoverdraft:=seq(.N)-1, by=cumsum(overdraft == 0)]
Let's test with a simple set of data:
overdraft = data.table(userId=rep(1:10^3, each=350),
date=rep(1:350, 10^3),
balance=round(rnorm(35*10^3)*100),
overdraft=0)
overdraft[balance < 0, overdraft:=1]
With 350k rows it runs acceptable fast (~1.2 seconds on my laptop), however, not even close to the speed of Arun's answer, which is at least 20x faster.
Upvotes: 1
Reputation: 9687
Can't say if this will help your performance problem, but rle
is helpful here for nice short code. Since the value of overdraft is always zero or one, we can take the max of a product of length and value:
> aggregate(overdraft~userId, df, FUN=function(x) {
+ r <- rle(x)
+ max(r$lengths * r$values)
+ })
userId overdraft
1 600 2
2 700 3
Upvotes: 2