Stefan Voigt
Stefan Voigt

Reputation: 209

Apply function within mutate

I have a data.table object which contains Time-stamps (measured as Seconds after Midnight). My aim is to run a function which returns for each row the number of observations which occurred at max $k$ seconds before the observation.

require(data.table, dplyr, dtplyr)
set.seed(123)
DF  <- data.frame(Secs=cumsum(rexp(10000,1)))
setDT(DF)  
> DF
               Secs
1: 8.434573e-01
2: 1.420068e+00
3: 2.749122e+00
4: 2.780700e+00
5: 2.836911e+00
---             
 9996: 1.003014e+04
 9997: 1.003382e+04
 9998: 1.003384e+04
 9999: 1.003414e+04
10000: 1.003781e+04

The function I want to apply to each row is

nS<-function(Second,k=5) 
    max(1,nrow(DF%>%filter(Secs<Second & Secs>=Second-k)))

One way to obtain what I want is to use apply, which takes rather long.

 system.time(val <- apply(DF,1,nS))
   User      System verstrichen 
  20.56        0.03       20.66 

#Not working 
DF%>%mutate(nS=nS(Secs,100))%>%head()

# Also not working
library(lazyeval)
f = function(col1, new_col_name) {
    mutate_call = lazyeval::interp(~ nS(a), a = as.name(col1))
    DF%>%mutate_(.dots=setNames(list(mutate_call),new_col_name))
}
head(f('Secs', 'nS'))

DF%>%mutate(minTime=Secs-k)%>%head()

Isn't it possible to do this approach by using mutate? Thanks a lot for your help!

Upvotes: 3

Views: 14579

Answers (2)

Jonathan von Schroeder
Jonathan von Schroeder

Reputation: 1713

If you are ok with not using dplyr at all the following is very fast:

applyNS <- function(s,k=5) {
  cnt <- numeric(length(s))
  for(i in 1:length(s)) {
    res <- (s[(1+i):length(s)] - s[1:(length(s)-i)]) <= k
    cnt[(1+i):length(s)] <- cnt[(1+i):length(s)] + res
    if(!any(res)) break
  }
  cnt
}

The function assumes that s is sorted in ascending order.

The result of this function is slightly different: Your code gives a count of one even if the difference to the previous timestamp is already larger than k. But this is easily adjusted and then the results are the same:

DF  <- data.frame(Secs=cumsum(rexp(10000,1)))
nS<-function(Second,k=5) 
  max(1,nrow(DF%>%filter(Secs<Second & Secs>=Second-k)))
result <- apply(DF,1,nS)
result1 <- applyNS(DF$Secs)
result1[result1 == 0] <- 1
print(all(result - result1 == 0))

prints out '[1] TRUE'. Notice that this implementation is much faster:

> system.time(apply(DF, 1, nS))
       User      System verstrichen 
       8.31        0.00        8.43 
> system.time(replicate(100,{result1 <- applyNS(DF$Secs); result1[result1 == 0] <- 1}))/100
       User      System verstrichen 
     0.0071      0.0000      0.0073 

Upvotes: 2

Nate
Nate

Reputation: 10671

Does using rowwise() work for you?

DF %>% rowwise() %>% mutate(ns = nS(Secs), # default k = 5, equal to your apply
                            ns2 = nS(Secs, 100)) # second test case k = 100
Source: local data frame [10,000 x 3]
Groups: <by row>

# A tibble: 10,000 × 3
        Secs    ns   ns2
       <dbl> <dbl> <dbl>
1  0.1757671     1     1
2  1.1956531     1     1
3  1.6594676     2     2
4  2.6988685     3     3
5  2.8845783     4     4
6  3.1012975     5     5
7  4.1258548     6     6
8  4.1584318     7     7
9  4.2346702     8     8
10 6.0375495     8     9
# ... with 9,990 more rows

It's only slightly faster than apply, on my machine...

system.time(DF %>% rowwise() %>% mutate(ns = nS(Secs)))
   user  system elapsed 
 13.934   1.060  15.280 

system.time(apply(DF, 1, nS))
   user  system elapsed 
 14.938   1.101  16.438 

Upvotes: 5

Related Questions