Reputation: 683
I have a large datframe (about 3 million rows) which contain an ID, a year and three dates each: lookupdate
, date1
and date2
. The data.frame
is sorted by ID
and date1
. I want to search through the whole data set and find the records i
which:
financial_year == 2013
andID
as any other line j
such that date1[j] < lookupdate[i] < date2[j]
I have implemented this logic below, but it is ridiculously slow. Do you have any idea how to speed up this code?
calc_hits_bruteforce <- function(d){
N <- nrow(d)
hits <- rep(FALSE, N)
for (i in 2:N) {
if(d[i,"financial_year"]!=2013) next
for (j in i:1) {
if (d[i,"ID"]!=d[j,"ID"]) {
break
}
else {
if (d[j,"date1"] < d[i,"lookupdate"] & d[j, "date2"] > d[i, "lookupdate"]) {
hits[i] <- TRUE
break
}
}
}
}
hits
}
I don't know how many records there are for each ID
, but I know the lookupdate
for each record lies before date1
and date2
, i.e. lookupdate[i] < date1[i] < date2[i]
for all i
.
Here's an example of the dataframe and the output:
> d.ex
ID lookupdate date1 date2 financial_year
1 C143896B 2011-02-24 2011-11-09 2011-11-21 2011
2 C143896G 2010-11-23 2011-10-29 2011-11-21 2011
3 C143896G 2011-11-11 2012-10-12 2012-11-05 2012
4 C143896G 2012-06-17 2013-01-30 2013-02-11 2013
5 C143896G 2012-10-31 2013-09-15 2013-09-29 2013
> calc_hits_bruteforce(d.ex)
[1] FALSE FALSE FALSE FALSE TRUE
The last row is TRUE since 2012-10-12 < 2012-10-31 < 2012-11-05.
Upvotes: 1
Views: 1697
Reputation: 46886
From the way you pose the question, it sounds like you're interested in a logical vector of length equal to the number of rows in d
, so pre-allocate that
hits = logical(nrow(d)) ## initialized to 'FALSE'
You're interested in the subset of rows from a particular financial year, so vectorize the selection
i_idx <- which(d$financial_year == 2013)
For each of these you'll update hits
to be true if any other row satisfies some complicated condition; it's not obvious how to avoid the outer loop (although specific features of your data (e.g., only a few IDs) might suggest a different strategy), but the inner loop can be vectorized as
for (i in i_idx)
hits[i] <- any(d[, date1] < d[i, lookupdate] &
d[, date2] > d[i, lookupdate] &
d[, ID] == d[i, ID] &
seq_len(nrow(d)) < i)
}
So combined and with a little optimization
calc_hits_bruteforce <- function(d) {
hits <- logical(nrow(d))
i_idx <- which(d$financial_year == 2013)
for (i in i_idx) {
lkup <- d[i, lookupdate]
hits[i] <- any((d$date1 < lkup) & (d$date2 > lkup) &
(d$ID == d[i, ID]) & (seq_len(nrow(d)) < i))
}
hits
}
This will be faster than your original, but does not exploit the sorted nature of your data and will scale approximately with the number of rows in your data frame (instead of with the square of the number of rows, as in your original algorithm).
One possible improvement is to use the Bioconductor IRanges package. Install and attach with
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
library(IRanges)
IRanges are integer-valued, so representation of dates becomes important. I read your data as
txt <- "ID lookupdate date1 date2 financial_year
C143896B 2011-02-24 2011-11-09 2011-11-21 2011
C143896G 2010-11-23 2011-10-29 2011-11-21 2011
C143896G 2011-11-11 2012-10-12 2012-11-05 2012
C143896G 2012-06-17 2013-01-30 2013-02-11 2013
C143896G 2012-10-31 2013-09-15 2013-09-29 2013"
d <- read.delim(textConnection(txt),
colClasses=c("factor", "Date", "Date", "Date", "integer"),
sep="")
Then represent the dates and lookup as IRanges (the range representation includes end points, but you're not interested in that).
dates = with(d, IRanges(as.integer(date1) + 1, as.integer(date2) - 1))
lkup = with(d, IRanges(as.integer(lookupdate), width=1))
Find overlapping ranges (this finds all overlapping ranges; we weed out the unwanted values later; the comparison is efficient, as described on the help page ?IntervalTree)
olaps = findOverlaps(query=dates, subject=lkup)
and fine-tune
q_hits = queryHits(olaps); s_hits = subjectHits(olaps)
keep = (d[s_hits, "financial_year"] == 2013) &
(d[s_hits, "ID"] == d[q_hits, "ID"]) & (q_hits < s_hits)
tabulate(s_hits[keep], length(lkup)) != 0
This will be fast, though I might have got the edge cases wrong.
Upvotes: 2
Reputation: 2549
test <- structure(list(ID = c("C143896B", "C143896G", "C143896G", "C143896G",
"C143896G"), lookupdate = structure(c(15029, 14936, 15289, 15508,
15644), class = "Date"), date1 = structure(c(15287, 15276, 15625,
15735, 15963), class = "Date"), date2 = structure(c(15299, 15299,
15649, 15747, 15977), class = "Date"), financial_year = c(2011,
2011, 2012, 2013, 2013)), .Names = c("ID", "lookupdate", "date1",
"date2", "financial_year"), row.names = c(NA, -5L), class = "data.frame")
I would suggest this but I'm afraid I could not test its performance:
calc_hits_bruteforce2 <- function(db){
a <- sapply(test[,2],FUN=function(x)(test[,3] < x & x < test[,4] ))
b <- sapply(test[,1],FUN=function(x)(x==test[,1]))
c <- matrix(sapply(test[,5], FUN=function(x)(x==2013)),nrow(a),nrow(a), byrow=T)
d <- a==TRUE & a==b & a==c
rows <- round(which(d==TRUE)/nrow(a))
test[rows,]
}
## ID lookupdate date1 date2 financial_year
## 5 C143896G 2012-10-31 2013-09-15 2013-09-29 2013
Upvotes: 0