Reputation: 153
I have two very large data frames (50MM+ rows) and I need to run some calculations on them. I have developed the following loop, but it runs too slowly. I tried using apply and other methods, but I couldn't get them to work.
#### Sample Data
df=data.frame(id=1:10,time=Sys.time()-1:10,within5=NA)
df2=data.frame(id2=c(1,1,1,5,5,10),time2=Sys.time()-c(9,5,2,3,4,6))
#### Loop shows how many results from df2 are within 5 secs of the creation of the ID in df
for (i in 1:length(df$id))
{
temp=df2[df2$id==df$id[i],]
df$within5[i]=sum(abs(as.numeric(difftime(temp$time2,df$time[i],units="secs")))<5)
}
Upvotes: 1
Views: 175
Reputation: 46856
Use the second id to look up the reference time, and subtract the event time from that, for your data above
dt <- df2$time2 - df$time[df2$id]
then select event ids with absolute time differences less than 5
okIds <- df2$id2[abs(as.numeric(dt)) < 5]
tabulate these, and add to your original data frame
df$within5 <- tabulate(okIds, max(df$id))
This relies on the ids being sequential integers (if not, make them a factor()
and then use the integer encoding the results) and is very fast.
Upvotes: 1
Reputation: 98439
To check improvement of procedures, made larger sample data.
df=data.frame(id=1:100,time=Sys.time()-1:100)
df2=data.frame(id2=sample(1:100,300000,replace=T),time2=Sys.time()-sample(1:5,300000,replace=T))
Use function ddply()
from package plyr
to divide your data according to column id2
. Then apply your function to each subset.
library(plyr)
df3 <- ddply(df2,"id2",function(x){
data.frame(within5=sum(abs(as.numeric(difftime(x$time2,df$time[df$id==x$id2[1]],units="secs")))<5))})
As a result we get new data frame.
head(df3)
id2 within5
1 1 3129
2 2 3032
3 3 2935
4 4 3121
5 5 3042
6 6 2426
If you need column within5
in your original data frame you can use function merge()
.
df4 <- merge(df,df3,by.x="id",by.y="id2",all=T)
With my sample data this calculation was 10 time faster.
Upvotes: 3