selecting data.table rows using a function

Question

I have a data.table, and a list of dates. I wish to filter and amend the rows using a function that checks to see if the dates against the list.

# example data
set.seed(1)
tt <- sample(
             seq(as.POSIXct("2011-10-02"), as.POSIXct("2014-04-06"),
                 by = "day"), 10)
IR1 <- data.table(tstamp = sort(tt), dLoc = 1L:10L)

List of dates:

DLSlist <- lapply(
                  list(dls11t12 = c("2011-10-02", "2012-04-01"), 
                       dls12t13 = c("2012-10-07", "2013-04-07"), 
                       dls13t14 = c("2013-10-06", "2014-04-06"),
                       dls14t15 = c("2014-10-05", "2015-04-05"),
                       dls15t16 = c("2015-10-04", "2016-04-03"),
                       dls16t17 = c("2016-10-02", "2017-04-02")
                       ), 
                  function(X) as.POSIXct(X)
                  )

I would like to transform dLoc if it falls inside any of the date ranges in DLSlist. I can do it the long way as follows:

IR1[tstamp > DLSlist[[1]][1] & tstamp < DLSlist[[1]][2], tstamp := tstamp + 60*60]
IR1[tstamp > DLSlist[[2]][1] & tstamp < DLSlist[[2]][2], tstamp := tstamp + 60*60]
IR1[tstamp > DLSlist[[3]][1] & tstamp < DLSlist[[3]][2], tstamp := tstamp + 60*60]

However that seems error-prone: a function is suited to this task ... mine didn't work.

DLStest <- function(dd, DLSobj) {
    any(sapply(DLSobj, function(X) dd %between% X))
}

I applied it with:

IR1[DLStest(tstamp, DLSlist), tstamp := tstamp + 60*60]

However it didn't work: all of the rows were transformed (not only the ones inside the ranges, as had been the case in my ugly hack code).

Is there some means of selecting rows using a function -- or some other means of selecting rows based upon multiple range checks?

Update (with thanks to Frank, who spotted the issue)

You can indeed filter with a function that returns a vector or booleans. The error was all with my initial function.

DLStest_old <- function(dd, DLSobj) {
    any(sapply(DLSobj, function(X) dd %between% X))
}

sapply returns an object who's class is matrix; any checks to see if there are any true values in the entire matrix. If there are any true values is evaluates to a single TRUE. If not, it evaluates to a single FALSE.

Using the test data:

(IR1[DLStest_old(tstamp, DLSlist), dLoc := dLoc + 1000L])

                 tstamp dLoc
 1: 2011-11-27 01:00:00 1001
 2: 2012-04-03 00:00:00 1002
 3: 2012-06-01 00:00:00 1003
 4: 2012-09-06 00:00:00 1004
 5: 2013-03-09 01:00:00 1005
 6: 2013-04-25 00:00:00 1006
 7: 2013-05-25 00:00:00 1007
 8: 2013-12-29 01:00:00 1008
 9: 2014-01-09 01:00:00 1009
10: 2014-02-08 01:00:00 1010

The fix is to test separately for each row of the matrix, using apply.

DLStest <- function(dd, DLSobj) {
    apply(sapply(DLSobj, function(X) dd %between% X), 1, any)
}

This now works:

> (IR1[DLStest(tstamp, DLSlist), dLoc := dLoc + 1000L])
                 tstamp dLoc
 1: 2011-11-27 01:00:00 1001
 2: 2012-04-03 00:00:00    2
 3: 2012-06-01 00:00:00    3
 4: 2012-09-06 00:00:00    4
 5: 2013-03-09 01:00:00 1005
 6: 2013-04-25 00:00:00    6
 7: 2013-05-25 00:00:00    7
 8: 2013-12-29 01:00:00 1008
 9: 2014-01-09 01:00:00 1009
10: 2014-02-08 01:00:00 1010

Frank · Accepted Answer

You want to subset with a logical vector. In your initial formulation, the function only returns a single value (instead of a vector), causing your assignment to affect all or none of the rows.

IR <- copy(IR1)
DLStest_old <- function(dd, DLSobj) {
    any(sapply(DLSobj, function(X) dd %between% X))
}

# on the whole tstamp vector at once
  IR[,DLStest_old(tstamp, DLSlist)]
  # TRUE

One solution is to use your function, but apply it "by row":

# by row
  IR[,DLStest_old(tstamp, DLSlist),by=1:nrow(IR)]$V1
  # TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

Note that I'm putting this in the j position of the data.table to return the result. Typically, to subset by an expression it can be put into the i position (before the first comma), however "by" does not apply to i expressions, so for this approach it's probably best to save the logical vector and then subset by it:

 # by row, for use in i
    change_em <- IR[,DLStest_old(tstamp, DLSlist),by=1:nrow(IR)]$V1
    IR[change_em,tstamp:=tstamp+1e15][]

I busted your dates to make the changes more clear, resulting in:

                 tstamp dLoc
 1: ))0'-06-03 15:45:52    1
 2: 2012-04-03 00:00:00    2
 3: 2012-06-01 00:00:00    3
 4: 2012-09-07 00:00:00    4
 5: ))0'-06-03 15:45:52    5
 6: 2013-04-26 00:00:00    6
 7: 2013-05-25 00:00:00    7
 8: ))0'-06-03 15:45:52    8
 9: ))0'-06-03 15:45:52    9
10: ))0'-06-03 15:45:52   10

Another solution that you found is to use something from the *apply family:

DLStest_apply <- function(dd, DLSobj) {
    apply(sapply(DLSobj, function(X) dd %between% X), 1, any)
}

# apply "any" on the margin of the sapply result
  IR[,DLStest_apply(tstamp, DLSlist)]
  # TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

apply is made for matrices and arrays and the result of this sapply is a matrix,

class(sapply(DLSlist, function(X) IR$tstamp %between% X))
# "matrix"

so this should be pretty fast. In general, sapply can return different types of results.

P.S. I think dates are hard to read at a glance and it's best not to use them in your examples if you can tell ahead of time that you don't need them.

selecting data.table rows using a function

Answers (2)

Related Questions