Daniel Cates
Daniel Cates

Reputation: 21

R - Making loops faster

This little code snippet is supposed to loop through a sorted data frame. It keeps a count of how many successive rows have the same information in columns aIndex and cIndex and also bIndex and dIndex. If these are the same, it deposits the count and increments it for the next time around, and if they differ, it deposits the count and resets it to 1 for the next time around.

for (i in 1:nrow(myFrame)) {
  if (myFrame[i, aIndex] == myFrame[i, cIndex] &
    myFrame[i, bIndex] == myFrame[i, dIndex]) {
      myFrame[i, eIndex] <- count
      count <- (count + 1)
  } else {
      myFrame[i, eIndex] <- count
      count <- 1
  }
}

It's been running for a long time now. I understand that I'm supposed to vectorize whenever possible, but I'm not really seeing it here. What am I supposed to do to make this faster?

Here's what an example few rows should look like after running:

aIndex bIndex cIndex dIndex eIndex
     1      2      1      2      1
     1      2      1      2      2
     1      2      4      8      3
     4      8      1      4      1
     1      4      1      4      1

Upvotes: 1

Views: 165

Answers (2)

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

Maybe this will work. I have reworked the rle and sequence bits.

dat <- read.table(text="aIndex bIndex cIndex dIndex
1 2 1 2
1 2 1 2
1 2 4 8
4 8 1 4
1 4 1 4", header=TRUE, as.is=TRUE,sep = " ")
dat$eIndex <-NA
#identify rows where a=c and b=d, multiply by 1 to get a numeric vector
dat$id<-(dat$aIndex==dat$cIndex & dat$bIndex==dat$dIndex)*1
#identify sequence
runs <- rle(dat$id)
#create sequence, multiply by id to keep only identicals, +1 at the end
count <-sequence(runs$lengths)*dat$id+1
#shift sequence down one notch, start with 1
dat$eIndex <-c(1,count[-length(count)])
dat

  aIndex bIndex cIndex dIndex eIndex id
1      1      2      1      2      1  1
2      1      2      1      2      2  1
3      1      2      4      8      3  0
4      4      8      1      4      1  0
5      1      4      1      4      1  1

Upvotes: 2

Brian Diggs
Brian Diggs

Reputation: 58825

I think this will do what you want; the tricky part is that the count resets after the difference, which effectively puts a shift on the eIndex.

There (hopefully) is an easier way to do this, but this is what I came up with.

tmprle <- rle(((myFrame$aIndex == myFrame$cIndex) & 
               (myFrame$bIndex == myFrame$dIndex)))
myFrame$eIndex <- c(1,
                    unlist(ifelse(tmprle$values, 
                                  Vectorize(seq.default)(from = 2,
                                                         length = tmprle$lengths), 
                                  lapply(tmprle$lengths, 
                                         function(x) {rep(1, each = x)})))
                    )[-(nrow(myFrame)+1)]

which gives

> myFrame
  aIndex bIndex cIndex dIndex eIndex
1      1      2      1      2      1
2      1      2      1      2      2
3      1      2      4      8      3
4      4      8      1      4      1
5      1      4      1      4      1

Upvotes: 2

Related Questions