Anflores
Anflores

Reputation: 119

R Programming: How to vectorize/speed up a for loop that need the previous value in each process

I'm doing a for loop to fill a vector. The problem is that in each loop it needs the previous value to keep doing calcs.

I'm using the package data.table, so the its a data table. R version 64 bits 3.2.3

The table has the f Im doing a for loop, but it takes time to run I would like to know if there is a way to vectorize or make this process faaster. I will explain what Im trying to achieve. First I have a table that as I have to use a loop for this part because I need the previous value so I cannot vectorize the operation.

The data table has the following structure:

NUMDCRED         FDES         Distancia      CURA   NPV
 0001        "2012-01-01"        11            0     1
 0001        "2012-02-01"        12            0     2
 0001        "2012-03-01"        13            1     2
 0001        "2011-01-01"        14            1     3
 0001        "2011-02-01"        15            1     3
 0001        "2011-03-01"        16            1     2 
 0001        "2011-04-01"        10            0     5
 0001        "2011-05-01"        11            0     4
 0001        "2011-06-01"        12            0     6 
 0001        "2011-07-01"        13            1     3
 0001        "2011-08-01"        14            1     2
 0001        "2011-09-01"        15            1     2
 0001        "2011-10-01"        16            1     1
 0001        "2011-11-01"        17            1     3
 0002        "2012-04-01"        11            0     6
 0002        "2012-05-01"        12            0     5
 0002        "2012-06-01"        13            1     4
 0002        "2012-07-01"        14            1     3
 0002        "2012-08-01"        15            1     3
 0002        "2012-09-01"        16            1     3
 0002        "2012-10-01"        10            0     3
 0002        "2012-11-01"        11            0     4
 0002        "2012-12-01"        12            0     4
 0002        "2013-01-01"        13            1     2
 0002        "2013-02-01"        14            1     2
 0002        "2013-03-01"        15            1     3
 0002        "2013-04-01"        16            1     3

The table is sorted (POBLACION_MOROSA6) by NUMDCRED and FDES (ascending order). What I need to do is to create other variable called P.Moroso, which value is set to one when the first different NUMDCRED appears, inscrease to P.Moroso + 1 when the condition NPV < 4 and Distancia > 12 and Cura[i-1] != 1 is reached. The value of P.Moroso must be keep it in each record until it changes when the condition is reached, with this I mean when the first NUMDCRED appears the value of P.Moroso is going to be 1 and also for the next record until it change to P.Moroso + 1 (2) when the condition is met and then this value would be keep it each record and so on.

The output of the process would be the following:

NUMDCRED         FDES         Distancia      CURA   NPV  P.Moroso
 0001        "2012-01-01"        11            0     1      1
 0001        "2012-02-01"        12            0     2      1
 0001        "2012-03-01"        13            1     2      2
 0001        "2011-01-01"        14            1     3      2
 0001        "2011-02-01"        15            1     3      2
 0001        "2011-03-01"        16            1     2      2
 0001        "2011-04-01"        10            0     5      2
 0001        "2011-05-01"        11            0     4      2
 0001        "2011-06-01"        12            0     6      2
 0001        "2011-07-01"        13            1     3      3
 0001        "2011-08-01"        14            1     2      3
 0001        "2011-09-01"        15            1     2      3
 0001        "2011-10-01"        16            1     1      3
 0001        "2011-11-01"        17            1     3      3
 0002        "2012-04-01"        11            0     6      1
 0002        "2012-05-01"        12            0     5      1
 0002        "2012-06-01"        13            1     4      2
 0002        "2012-07-01"        14            1     3      2
 0002        "2012-08-01"        15            1     3      2
 0002        "2012-09-01"        16            1     3      2
 0002        "2012-10-01"        10            0     3      2
 0002        "2012-11-01"        11            0     4      2
 0002        "2012-12-01"        12            0     4      2
 0002        "2013-01-01"        13            1     2      3
 0002        "2013-02-01"        14            1     2      3
 0002        "2013-03-01"        15            1     3      3
 0002        "2013-04-01"        16            1     3      3  

For the moment Im using the following simple foor loop to do this:

PERIODO_MOROSO <- vector(mode = "numeric",length=N3)
isFirstNumdCred_Morosa6 <- (1:N3) %in% FIRST_NUMDCRED_INDEX_P.MOROSA6

for(i in 1:N3){ 

   if(isFirstNumdCred_Morosa6[i]){

      P.MOROSO <- 1
   } else if(POBLACION_MOROSA6[i,NPV] < 4 & POBLACION_MOROSA6[i-1,CURA] ! =1   & POBLACION_MOROSA6[i,DISTANCIA_SALIDA] > 12){

     P.MOROSO <- P.MOROSO + 1
   }

   PERIODO_MOROSO[i] <- P.MOROSO
}

POBLACION_MOROSA6$P.MOROSO <- PERIODO_MOROSO 

The variable isFirstNumdCred_Morosa6 is a logical vector that indicates when the first different Numdcred Appears. My problem with the foor loop is that it is slow when working with large data (my tables have rows between 900k and 2 million. I tried using something with

ex[,date.seq.3:=ifelse( condition, shift(P.Moroso) +1 , P.Moroso)]

but it didn't work (first I assigned all the ones to the rows with the first different NUMDCRED)

Also I tried using other methods that I other people told me in this question I posted before, but I couldn't do it. I put the link of the other question if anyone want to see the solution to a similar problema I had.

So in conclusion I would like to know if it is posible to vectorize/speed up this process. R programming :How to speed up a loop that takes 2 hours and the reasons why it takes a lot

Upvotes: 1

Views: 210

Answers (3)

Michael Frasco
Michael Frasco

Reputation: 414

I think I have a fast solution, but I haven't tested it, so I don't really know. Here is my thought process:

  1. you can first split the data by the value of NUMDCRED, since the value of P.Moroso always starts at 1 each time that NUMDCRED changes. Put each subset of the data into a list.

  2. You can now apply a function using lapply to each dataset in the list. First, create a column that is TRUE if the condition that you specified is satisfied and FALSE if the condition is not specified. Then, you can take a cumulative sum of this column and store this as your P.Moroso column. I think that should be what you want.

  3. Merge all of the data sets back together.

Upvotes: -1

rawr
rawr

Reputation: 20811

You do not need loops

ex <- read.table(header = TRUE, text = 'NUMDCRED         FDES         Distancia      CURA   NPV  P.Moroso
 0001        "2012-01-01"        11            0     1      1
                 0001        "2012-02-01"        12            0     2      1
                 0001        "2012-03-01"        13            1     2      2
                 0001        "2011-01-01"        14            1     3      2
                 0001        "2011-02-01"        15            1     3      2
                 0001        "2011-03-01"        16            1     2      2
                 0001        "2011-04-01"        10            0     5      2
                 0001        "2011-05-01"        11            0     4      2
                 0001        "2011-06-01"        12            0     6      2
                 0001        "2011-07-01"        13            1     3      3
                 0001        "2011-08-01"        14            1     2      3
                 0001        "2011-09-01"        15            1     2      3
                 0001        "2011-10-01"        16            1     1      3
                 0001        "2011-11-01"        17            1     3      3
                 0002        "2012-04-01"        11            0     6      1
                 0002        "2012-05-01"        12            0     5      1
                 0002        "2012-06-01"        13            1     4      2
                 0002        "2012-07-01"        14            1     3      2
                 0002        "2012-08-01"        15            1     3      2
                 0002        "2012-09-01"        16            1     3      2
                 0002        "2012-10-01"        10            0     3      2
                 0002        "2012-11-01"        11            0     4      2
                 0002        "2012-12-01"        12            0     4      2
                 0002        "2013-01-01"        13            1     2      3
                 0002        "2013-02-01"        14            1     2      3
                 0002        "2013-03-01"        15            1     3      3
                 0002        "2013-04-01"        16            1     3      3  ')

In base, you can write your logic into a function

f <- function(data)
  cumsum(with(data, Distancia > 12 & NPV <= 4 & c(0, CURA[-length(CURA)]) != 1)) + 1L

and apply it to subsets of the data

ex$P.Moroso2 <- unlist(by(ex, dd$NUMDCRED, f))

identical(ex$P.Moroso, ex$P.Moroso2)
# [1] TRUE

Translated to data.table, this would look like

setDT(ex)[, P.Moroso3 := 
  cumsum(Distancia > 12 & NPV <= 4 & shift(CURA, fill = 0) != 1) + 1L
, by = NUMDCRED]
# or Frank says this works, anyways

Upvotes: 3

VR_1312
VR_1312

Reputation: 161

You mean something like this...? (suppose your table name is "TABLA")

P.moroso = c(1)
NUMDCRED = TABLA$NUMDCRED
Cura = TABLA$Cura
NPV = TABLA$NPV
Distancia = TABLA$Distancia   #right now, I just created vectors with the needed columns information

N = length (NUMEDRED)
contador = 1 #the counter set in 1
for (i in 2:N){
    if (NUMDCRED[i-1] != NUMDCRED[i])
       contador = 1  #sets contador in 1 again
    else if ((NVP[i] <4) && (Distancia[i] > 12)&& (Cura[i-1] != 1))
       contador = contador +1  #if the condition happens, increases contador in 1
    P.moroso[i] = contador #append contador in P.moroso vector.
}

Now, you should have a P.moroso vector with the numbers you want. Finally, you attach it to your table:

TABLA$P.moroso = P.moroso

Upvotes: 0

Related Questions