Prateek
Prateek

Reputation: 61

R dataframe uses values in current row from previous row

I have a data frame in R as defined below:

df <- data.frame('ID'=c(1,1,1,1),
                    'Month' =c('M1','M2','M3','M4'),
                 "Initial.Balance" =c(100,100,100,0),
                    "Value" = c(0.1,0.2,0.2,0.2),
                    "Threshold"=c(0.05,0.18,0.25,0.25),
                    "Intermediate.Balance"=c(0,0,100,0),
                    "Final.Balance"=c(100,100,0,0))

This task uses Initial.Balance (in current row) from the Final.Balance of the previous row.

  1. When Value >= Threshold, Intermediate.Balance=0 and Final.Balance = Initial.Balance-Intermediate.Balance
  2. When Value < Threshold, Intermediate.Balance = Initial.Balance and Final.Balance = Initial.Balance-Intermediate.Balance

I have tried to accomplish this task using for loop but it takes lot of time on large dataset (for many IDs)

Here is my solution:

for (i in 1:nrow(df)){
  df$Intermediate.Balance[i] <- ifelse(df$Value[i]>df$Threshold[i],0,df$Initial.balance[i])
  df$Final.Balance[i] <- df$Initial.balance[i]-df$Intermediate.Balance[i]
  if(i+1<=nrow(df)){
  df$Initial.balance[i+1] <- df$Final.Balance[i] }
}

Can we look for similar solution using Data Table? As data table operations are quicker than for loop on dataframe, I believe this will help me save computation time.

Thanks,

Upvotes: 0

Views: 83

Answers (2)

chinsoon12
chinsoon12

Reputation: 25225

I think in this particular case, final balance goes to 0 once there is a row with Value less than Threshold and subsequent balances all go to 0. So you can use this:

ib <- 100
df[, InitBal := ib * 0^shift(cumsum(Value<=Threshold), fill=0L)]
df[, ItmdBal := replace(rep(0, .N), which(Value<=Threshold)[1L], ib)]
df[, FinlBal := InitBal - ItmdBal]

or in one []:

df[, c("InitBal", "ItmdBal", "FinlBal") := {
    v <- Value<=Threshold
    InitBal <- ib * 0^shift(cumsum(v), fill=0L)
    ItmdBal <- replace(rep(0, .N), which(v)[1L], ib)
    .(InitBal, ItmdBal, InitBal - ItmdBal)
}]

Or a more general approach using Rcpp when the intermediate balance is not simply equal to the initial balance:

library(Rcpp)
cppFunction('List calc(NumericVector Value, NumericVector Threshold, double init) {
    int n = Value.size();
    NumericVector InitialBalance(n), IntermediateBalance(n), FinalBalance(n);

    InitialBalance[0] = init;
    for (int i=0; i<n; i++) {
        if (Value[i] <= Threshold[i]) {
            IntermediateBalance[i] = InitialBalance[i];
        } 
        FinalBalance[i] = InitialBalance[i] - IntermediateBalance[i];
        if (i < n-1) {
            InitialBalance[i+1] = FinalBalance[i];
        }
    }
   
    return List::create(Named("InitialBalance") = InitialBalance,
        Named("IntermediateBalance") = IntermediateBalance,
        Named("FinalBalance") = FinalBalance);
}')
setDT(df)[, calc(Value, Threshold, Initial.Balance[1L])]

Upvotes: 3

Dean MacGregor
Dean MacGregor

Reputation: 18426

I can't see an obvious way of getting rid of the loop since each row is deterministic into the next. That being said, data.frames copy the whole frame or at least whole columns whenever you set some portion of them. As such you can do this:

dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
  dt[i,Intermediate.Balance:=ifelse(Value>Threshold,0,Initial.Balance)]
  dt[i,Final.Balance:=Initial.Balance-Intermediate.Balance]
  if(i+1<=nrow(dt)) dt[i+1,Initial.Balance:=dt[i,Final.Balance]]
}

You could also try the set function but I'm not sure if it'll be faster, or by how much, given that the data comes from the data.table anyway.

dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
  i<-as.integer(i)
  set(dt,i,"Intermediate.Balance", ifelse(dt[i,Value]>dt[i,Threshold],0,dt[i,Initial.Balance]))
  set(dt,i,"Final.Balance", dt[i,Initial.Balance-Intermediate.Balance])
  if(i+1<=nrow(dt)) set(dt,i+1L,"Initial.Balance", dt[i,Final.Balance])
}

Upvotes: 0

Related Questions