jester
jester

Reputation: 309

make a more efficient loop in data.table file

I am trying to subset the current data.table with a loop. The process is extremely slow. Here is the data sample before subsetting.

 #      V1  V2     V3     type
 #1:    1 362.25 361.75    A
 #2:    1 362.25 361.75    B
 #3:    3 362.25 361.75    C
 #4:    4 362.75 362.00    C
 #5:    8 362.50 362.00    A
 #6:    6 362.50 362.00    B
 #7:   12 362.50 362.89    A
 #8:    8 362.25 362.05    B
 #9:    9 362.25 362.00    A
 #10:  17 362.25 362.20    B
 #11:  29 362.25 362.90    C
 #12:  41 362.25 362.40    C
 #13:  99 362.25 362.10    C
 #14:  81 362.25 362.00    C

I want to subset the data based on the variable "type". I want to only keep all rows with (data$type =="c"), also I need the two rows of (data$type =="A") and (data$type =="B") only if they are followed by a (data$type="C") row. After subsetting, the data should look like:

 #      V1  V2     V3     type
 #1:    1 362.25 361.75    A
 #2:    1 362.25 361.75    B
 #3:    3 362.25 361.75    C
 #4:    4 362.75 362.00    C
 #9:    9 362.25 362.00    A
 #10:  17 362.25 362.20    B
 #11:  29 362.25 362.90    C
 #12:  41 362.25 362.40    C
 #13:  99 362.25 362.10    C
 #14:  81 362.25 362.00    C

If (data$type == "C") then that row needs to keep. Row #1,#2,#9,#10 are also kept because they are followed by "type==C" rows.

I am using a loop now to do it, but it is extremely slow.

    data$temp<-"omit"
  for (j in 3:nrow(data)){
    if (data$type[j] == "C" && data$type[j-1] == "B"
        &&  data$type[j-2] == "A" )
    {
      data$temp[j] <- "pair" ; data$temp[j-1] <- "pair"; data$temp[j-2] <- "pair"
    }
  }



  for (j in 2:nrow(data)){
    if (data$type[j-1] == "C" && data$type[j] == "C"
        && data$temp[j-1] == "pair" && data$temp[j]== "omit")
    {
      nearby$temp[j] <- "pair"
    }
  }  

data<-data[!(data$temp=="omit"),]

This code works fine, but just too slow. Please give me some ideas to improve the efficiency but does the same job.

Thank you so much

Upvotes: 1

Views: 70

Answers (2)

Jaap
Jaap

Reputation: 83275

There is no need for a for-loop in this case. With the use of the shift-function of data.table, you can subset your data as follows (assuming that A and B are always in the presented order):

DT[type=='C' | (type=='A' & shift(type, 2, NA, 'lead')=='C') | (type=='B' & shift(type, 1, NA, 'lead')=='C')]

which gives:

    V1     V2     V3 type
 1:  1 362.25 361.75    A
 2:  1 362.25 361.75    B
 3:  3 362.25 361.75    C
 4:  4 362.75 362.00    C
 5:  9 362.25 362.00    A
 6: 17 362.25 362.20    B
 7: 29 362.25 362.90    C
 8: 41 362.25 362.40    C
 9: 99 362.25 362.10    C
10: 81 362.25 362.00    C

Upvotes: 3

R. Schifini
R. Schifini

Reputation: 9313

You can use which to obtain the index of the rows that have a "C". Then include indexes 1 and 2 numbers less than those found.

For example:

df = data.frame(d = c(1,2,3,4,5,6,7,8,9,0),t = c("A","B","C","C","A","B","A","B","C","C"))
> df
   d t
1  1 A
2  2 B
3  3 C
4  4 C
5  5 A
6  6 B
7  7 A
8  8 B
9  9 C
10 0 C

c(which(df$t=="C") will return:

[1]  3  4  9 10

but you also want to include rows 1, 2, 7 and 8. Then do:

df[sort(unique(c(which(df$t=="C"),which(df$t=="C")-1,which(df$t=="C")-2))),]

   d t
1  1 A
2  2 B
3  3 C
4  4 C
7  7 A
8  8 B
9  9 C
10 0 C

The sort and unique will remove repeats and the sort will set all index in order.

Note: I'm assuming that there are no sequences like CAC or CBC. This will also include sequences like BACC (doesn't check the order of A and B).

Upvotes: 1

Related Questions