Reputation: 209
I have a data.table object which contains Time-stamps (measured as Seconds after Midnight). My aim is to run a function which returns for each row the number of observations which occurred at max $k$ seconds before the observation.
require(data.table, dplyr, dtplyr)
set.seed(123)
DF <- data.frame(Secs=cumsum(rexp(10000,1)))
setDT(DF)
> DF
Secs
1: 8.434573e-01
2: 1.420068e+00
3: 2.749122e+00
4: 2.780700e+00
5: 2.836911e+00
---
9996: 1.003014e+04
9997: 1.003382e+04
9998: 1.003384e+04
9999: 1.003414e+04
10000: 1.003781e+04
The function I want to apply to each row is
nS<-function(Second,k=5)
max(1,nrow(DF%>%filter(Secs<Second & Secs>=Second-k)))
One way to obtain what I want is to use apply, which takes rather long.
system.time(val <- apply(DF,1,nS))
User System verstrichen
20.56 0.03 20.66
#Not working
DF%>%mutate(nS=nS(Secs,100))%>%head()
# Also not working
library(lazyeval)
f = function(col1, new_col_name) {
mutate_call = lazyeval::interp(~ nS(a), a = as.name(col1))
DF%>%mutate_(.dots=setNames(list(mutate_call),new_col_name))
}
head(f('Secs', 'nS'))
DF%>%mutate(minTime=Secs-k)%>%head()
Isn't it possible to do this approach by using mutate? Thanks a lot for your help!
Upvotes: 3
Views: 14579
Reputation: 1713
If you are ok with not using dplyr at all the following is very fast:
applyNS <- function(s,k=5) {
cnt <- numeric(length(s))
for(i in 1:length(s)) {
res <- (s[(1+i):length(s)] - s[1:(length(s)-i)]) <= k
cnt[(1+i):length(s)] <- cnt[(1+i):length(s)] + res
if(!any(res)) break
}
cnt
}
The function assumes that s
is sorted in ascending order.
The result of this function is slightly different: Your code gives a count of one even if the difference to the previous timestamp is already larger than k. But this is easily adjusted and then the results are the same:
DF <- data.frame(Secs=cumsum(rexp(10000,1)))
nS<-function(Second,k=5)
max(1,nrow(DF%>%filter(Secs<Second & Secs>=Second-k)))
result <- apply(DF,1,nS)
result1 <- applyNS(DF$Secs)
result1[result1 == 0] <- 1
print(all(result - result1 == 0))
prints out '[1] TRUE'. Notice that this implementation is much faster:
> system.time(apply(DF, 1, nS))
User System verstrichen
8.31 0.00 8.43
> system.time(replicate(100,{result1 <- applyNS(DF$Secs); result1[result1 == 0] <- 1}))/100
User System verstrichen
0.0071 0.0000 0.0073
Upvotes: 2
Reputation: 10671
Does using rowwise()
work for you?
DF %>% rowwise() %>% mutate(ns = nS(Secs), # default k = 5, equal to your apply
ns2 = nS(Secs, 100)) # second test case k = 100
Source: local data frame [10,000 x 3]
Groups: <by row>
# A tibble: 10,000 × 3
Secs ns ns2
<dbl> <dbl> <dbl>
1 0.1757671 1 1
2 1.1956531 1 1
3 1.6594676 2 2
4 2.6988685 3 3
5 2.8845783 4 4
6 3.1012975 5 5
7 4.1258548 6 6
8 4.1584318 7 7
9 4.2346702 8 8
10 6.0375495 8 9
# ... with 9,990 more rows
It's only slightly faster than apply
, on my machine...
system.time(DF %>% rowwise() %>% mutate(ns = nS(Secs)))
user system elapsed
13.934 1.060 15.280
system.time(apply(DF, 1, nS))
user system elapsed
14.938 1.101 16.438
Upvotes: 5