phildeutsch
phildeutsch

Reputation: 683

Speeding up nested loops in R

I have a large datframe (about 3 million rows) which contain an ID, a year and three dates each: lookupdate, date1 and date2. The data.frame is sorted by ID and date1. I want to search through the whole data set and find the records i which:

I have implemented this logic below, but it is ridiculously slow. Do you have any idea how to speed up this code?

calc_hits_bruteforce <- function(d){
  N <- nrow(d)
  hits <- rep(FALSE, N)
  for (i in 2:N) {
    if(d[i,"financial_year"]!=2013) next
    for (j in i:1) {
      if (d[i,"ID"]!=d[j,"ID"]) {
        break
      }
      else {
        if (d[j,"date1"] < d[i,"lookupdate"] & d[j, "date2"] > d[i, "lookupdate"]) {
          hits[i] <- TRUE
          break
        }
      }
    }
  }
  hits
}

I don't know how many records there are for each ID, but I know the lookupdate for each record lies before date1 and date2, i.e. lookupdate[i] < date1[i] < date2[i] for all i.

Here's an example of the dataframe and the output:

> d.ex
    ID     lookupdate      date1      date2 financial_year
1 C143896B 2011-02-24 2011-11-09 2011-11-21           2011
2 C143896G 2010-11-23 2011-10-29 2011-11-21           2011
3 C143896G 2011-11-11 2012-10-12 2012-11-05           2012
4 C143896G 2012-06-17 2013-01-30 2013-02-11           2013
5 C143896G 2012-10-31 2013-09-15 2013-09-29           2013
> calc_hits_bruteforce(d.ex)
[1] FALSE FALSE FALSE FALSE  TRUE

The last row is TRUE since 2012-10-12 < 2012-10-31 < 2012-11-05.

Upvotes: 1

Views: 1697

Answers (2)

Martin Morgan
Martin Morgan

Reputation: 46886

From the way you pose the question, it sounds like you're interested in a logical vector of length equal to the number of rows in d, so pre-allocate that

hits = logical(nrow(d))  ## initialized to 'FALSE'

You're interested in the subset of rows from a particular financial year, so vectorize the selection

i_idx <- which(d$financial_year == 2013)

For each of these you'll update hits to be true if any other row satisfies some complicated condition; it's not obvious how to avoid the outer loop (although specific features of your data (e.g., only a few IDs) might suggest a different strategy), but the inner loop can be vectorized as

for (i in i_idx)
    hits[i] <- any(d[, date1] < d[i, lookupdate] &
                   d[, date2] > d[i, lookupdate] &
                   d[, ID] == d[i, ID] &
                   seq_len(nrow(d)) < i)
}

So combined and with a little optimization

calc_hits_bruteforce <- function(d) {
    hits <- logical(nrow(d))
    i_idx <- which(d$financial_year == 2013)
    for (i in i_idx) {
        lkup <- d[i, lookupdate]
        hits[i] <- any((d$date1 < lkup) & (d$date2 > lkup) &
                       (d$ID == d[i, ID]) & (seq_len(nrow(d)) < i))

    }
    hits
}

This will be faster than your original, but does not exploit the sorted nature of your data and will scale approximately with the number of rows in your data frame (instead of with the square of the number of rows, as in your original algorithm).

One possible improvement is to use the Bioconductor IRanges package. Install and attach with

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
library(IRanges)

IRanges are integer-valued, so representation of dates becomes important. I read your data as

txt <- "ID     lookupdate      date1      date2 financial_year
C143896B 2011-02-24 2011-11-09 2011-11-21           2011
C143896G 2010-11-23 2011-10-29 2011-11-21           2011
C143896G 2011-11-11 2012-10-12 2012-11-05           2012
C143896G 2012-06-17 2013-01-30 2013-02-11           2013
C143896G 2012-10-31 2013-09-15 2013-09-29           2013"

d <- read.delim(textConnection(txt),
                colClasses=c("factor", "Date", "Date", "Date", "integer"),
                sep="")

Then represent the dates and lookup as IRanges (the range representation includes end points, but you're not interested in that).

dates = with(d, IRanges(as.integer(date1) + 1, as.integer(date2) - 1))
lkup = with(d, IRanges(as.integer(lookupdate), width=1))

Find overlapping ranges (this finds all overlapping ranges; we weed out the unwanted values later; the comparison is efficient, as described on the help page ?IntervalTree)

olaps = findOverlaps(query=dates, subject=lkup)

and fine-tune

q_hits = queryHits(olaps); s_hits = subjectHits(olaps)
keep = (d[s_hits, "financial_year"] == 2013) &
    (d[s_hits, "ID"] == d[q_hits, "ID"]) & (q_hits < s_hits)
tabulate(s_hits[keep], length(lkup)) != 0

This will be fast, though I might have got the edge cases wrong.

Upvotes: 2

DJJ
DJJ

Reputation: 2549

test <- structure(list(ID = c("C143896B", "C143896G", "C143896G", "C143896G", 
"C143896G"), lookupdate = structure(c(15029, 14936, 15289, 15508, 
15644), class = "Date"), date1 = structure(c(15287, 15276, 15625, 
15735, 15963), class = "Date"), date2 = structure(c(15299, 15299, 
15649, 15747, 15977), class = "Date"), financial_year = c(2011, 
2011, 2012, 2013, 2013)), .Names = c("ID", "lookupdate", "date1", 
"date2", "financial_year"), row.names = c(NA, -5L), class = "data.frame")

I would suggest this but I'm afraid I could not test its performance:

calc_hits_bruteforce2 <- function(db){
 a <- sapply(test[,2],FUN=function(x)(test[,3] < x & x < test[,4] ))
 b <- sapply(test[,1],FUN=function(x)(x==test[,1]))
 c <- matrix(sapply(test[,5], FUN=function(x)(x==2013)),nrow(a),nrow(a), byrow=T)
 d <- a==TRUE & a==b & a==c
 rows <- round(which(d==TRUE)/nrow(a))
 test[rows,]
}


##         ID lookupdate      date1      date2 financial_year
## 5 C143896G 2012-10-31 2013-09-15 2013-09-29           2013

Upvotes: 0

Related Questions