R data.table optimising. Overdraft financial data

Given the following data.table with financial data (35 million rows):

DT:

userId  Date         balance  overdraft (boolean)
600     2014-11-01   -100     1
600     2014-11-02   1000     0
600     2014-11-03   -100     1
600     2014-11-04   -100     1
600     2014-11-05   100      0
600     2014-11-06   100      0
700     2014-11-01   -100     1
700     2014-11-02   1000     0
700     2014-11-03   -100     1
700     2014-11-04   -100     1
700     2014-11-05   -100     1
700     2014-11-06   100      0

The case:

a.- Total number of max. consecutive overdraft days by userId.

userId  maxConsecutiveOverdraftDays   
600     2
700     3
800     0
900     1
1000    5

In this case I done the following:

acum = FALSE

for (i in 1:nrow(DT)) {

    if (DT[i]$overdraft == 1 ) {
      if (acum == TRUE) 
  {

        DT[i]$acumBalance <- DT[i]$balance + DT[i-1]$balance 
        DT[i]$totalConsecutiveOverdraftDays   <- DT[i]$overdraft + DT[i-1]$overdraft
  }

      if (DT[i]$userId == DT[i+1]$userId 
      && DT[i+1]$overdraft == 1 ) 
  {

        acum = TRUE
  }  
    else { acum = FALSE }
}
}

DT[,maxConsecutiveOverdraftDays:=max(totalConsecutiveOverdraftDays),by=userId]

It takes more than 12 hours to finish.

How can I improve the code and reduce the computation time?

Thanks in advance.

Upvotes: 2

Views: 172

Answers (3)

Arun
Arun

Reputation: 118779

Update: #686 is now implemented in 1.9.5. From NEWS:

7) rleid(), a convenience function for generating a run-length type id column to be used in grouping operations is now implemented. Closes #686. Check ?rleid examples section for usage scenarios.

With this function, we can do:

require(data.table) ## 1.9.5+
dt[, .(overdraft = overdraft[1L], .N), by=.(userId, 
      rleid(overdraft))][overdraft == 1L, max(N), by=userId]

There's a FR #686 open to implement such a function. But this'll have to be implemented along with fast rank function. We have not gotten to it yet.

Until then, you can do it this way:

dt[, .(overdraft = overdraft[1L], .N), by=.(userId, 
    cumsum(c(TRUE, diff(overdraft) != 0L)))][overdraft == 1L, max(N), by=userId]
#    userId V1
# 1:    600  2
# 2:    700  3

Edit: Corrected logic as pointed out by @Dirk. Thanks!

Upvotes: 2

Dirk
Dirk

Reputation: 1272

This one-liner should do the trick:

overdraft[, daysoverdraft:=seq(.N)-1, by=cumsum(overdraft == 0)]

Let's test with a simple set of data:

overdraft = data.table(userId=rep(1:10^3, each=350),
                       date=rep(1:350, 10^3),
                       balance=round(rnorm(35*10^3)*100),
                       overdraft=0)
overdraft[balance < 0, overdraft:=1]

With 350k rows it runs acceptable fast (~1.2 seconds on my laptop), however, not even close to the speed of Arun's answer, which is at least 20x faster.

Upvotes: 1

Neal Fultz
Neal Fultz

Reputation: 9687

Can't say if this will help your performance problem, but rle is helpful here for nice short code. Since the value of overdraft is always zero or one, we can take the max of a product of length and value:

> aggregate(overdraft~userId, df, FUN=function(x) {
+   r <- rle(x)
+   max(r$lengths * r$values)
+ })
  userId overdraft
1    600         2
2    700         3

Upvotes: 2

Related Questions