make a more efficient loop in data.table file

Question

I am trying to subset the current data.table with a loop. The process is extremely slow. Here is the data sample before subsetting.

 #      V1  V2     V3     type
 #1:    1 362.25 361.75    A
 #2:    1 362.25 361.75    B
 #3:    3 362.25 361.75    C
 #4:    4 362.75 362.00    C
 #5:    8 362.50 362.00    A
 #6:    6 362.50 362.00    B
 #7:   12 362.50 362.89    A
 #8:    8 362.25 362.05    B
 #9:    9 362.25 362.00    A
 #10:  17 362.25 362.20    B
 #11:  29 362.25 362.90    C
 #12:  41 362.25 362.40    C
 #13:  99 362.25 362.10    C
 #14:  81 362.25 362.00    C

I want to subset the data based on the variable "type". I want to only keep all rows with (data$type =="c"), also I need the two rows of (data$type =="A") and (data$type =="B") only if they are followed by a (data$type="C") row. After subsetting, the data should look like:

 #      V1  V2     V3     type
 #1:    1 362.25 361.75    A
 #2:    1 362.25 361.75    B
 #3:    3 362.25 361.75    C
 #4:    4 362.75 362.00    C
 #9:    9 362.25 362.00    A
 #10:  17 362.25 362.20    B
 #11:  29 362.25 362.90    C
 #12:  41 362.25 362.40    C
 #13:  99 362.25 362.10    C
 #14:  81 362.25 362.00    C

If (data$type == "C") then that row needs to keep. Row #1,#2,#9,#10 are also kept because they are followed by "type==C" rows.

I am using a loop now to do it, but it is extremely slow.

    data$temp<-"omit"
  for (j in 3:nrow(data)){
    if (data$type[j] == "C" && data$type[j-1] == "B"
        &&  data$type[j-2] == "A" )
    {
      data$temp[j] <- "pair" ; data$temp[j-1] <- "pair"; data$temp[j-2] <- "pair"
    }
  }



  for (j in 2:nrow(data)){
    if (data$type[j-1] == "C" && data$type[j] == "C"
        && data$temp[j-1] == "pair" && data$temp[j]== "omit")
    {
      nearby$temp[j] <- "pair"
    }
  }  

data<-data[!(data$temp=="omit"),]

This code works fine, but just too slow. Please give me some ideas to improve the efficiency but does the same job.

Thank you so much

Jaap · Accepted Answer

There is no need for a for-loop in this case. With the use of the shift-function of data.table, you can subset your data as follows (assuming that A and B are always in the presented order):

DT[type=='C' | (type=='A' & shift(type, 2, NA, 'lead')=='C') | (type=='B' & shift(type, 1, NA, 'lead')=='C')]

which gives:

    V1     V2     V3 type
 1:  1 362.25 361.75    A
 2:  1 362.25 361.75    B
 3:  3 362.25 361.75    C
 4:  4 362.75 362.00    C
 5:  9 362.25 362.00    A
 6: 17 362.25 362.20    B
 7: 29 362.25 362.90    C
 8: 41 362.25 362.40    C
 9: 99 362.25 362.10    C
10: 81 362.25 362.00    C

make a more efficient loop in data.table file

Answers (2)

Related Questions