Reputation: 8435
I have a data.table
, and a list of dates. I wish to filter and amend the rows using a function that checks to see if the dates against the list.
# example data
set.seed(1)
tt <- sample(
seq(as.POSIXct("2011-10-02"), as.POSIXct("2014-04-06"),
by = "day"), 10)
IR1 <- data.table(tstamp = sort(tt), dLoc = 1L:10L)
List of dates:
DLSlist <- lapply(
list(dls11t12 = c("2011-10-02", "2012-04-01"),
dls12t13 = c("2012-10-07", "2013-04-07"),
dls13t14 = c("2013-10-06", "2014-04-06"),
dls14t15 = c("2014-10-05", "2015-04-05"),
dls15t16 = c("2015-10-04", "2016-04-03"),
dls16t17 = c("2016-10-02", "2017-04-02")
),
function(X) as.POSIXct(X)
)
I would like to transform dLoc
if it falls inside any of the date ranges in DLSlist
. I can do it the long way as follows:
IR1[tstamp > DLSlist[[1]][1] & tstamp < DLSlist[[1]][2], tstamp := tstamp + 60*60]
IR1[tstamp > DLSlist[[2]][1] & tstamp < DLSlist[[2]][2], tstamp := tstamp + 60*60]
IR1[tstamp > DLSlist[[3]][1] & tstamp < DLSlist[[3]][2], tstamp := tstamp + 60*60]
However that seems error-prone: a function is suited to this task ... mine didn't work.
DLStest <- function(dd, DLSobj) {
any(sapply(DLSobj, function(X) dd %between% X))
}
I applied it with:
IR1[DLStest(tstamp, DLSlist), tstamp := tstamp + 60*60]
However it didn't work: all of the rows were transformed (not only the ones inside the ranges, as had been the case in my ugly hack code).
Is there some means of selecting rows using a function -- or some other means of selecting rows based upon multiple range checks?
Update (with thanks to Frank, who spotted the issue)
You can indeed filter with a function that returns a vector or booleans. The error was all with my initial function.
DLStest_old <- function(dd, DLSobj) {
any(sapply(DLSobj, function(X) dd %between% X))
}
sapply
returns an object who's class
is matrix
; any
checks to see if there are any true values in the entire matrix
. If there are any true values is evaluates to a single TRUE
. If not, it evaluates to a single FALSE
.
Using the test data:
(IR1[DLStest_old(tstamp, DLSlist), dLoc := dLoc + 1000L])
tstamp dLoc
1: 2011-11-27 01:00:00 1001
2: 2012-04-03 00:00:00 1002
3: 2012-06-01 00:00:00 1003
4: 2012-09-06 00:00:00 1004
5: 2013-03-09 01:00:00 1005
6: 2013-04-25 00:00:00 1006
7: 2013-05-25 00:00:00 1007
8: 2013-12-29 01:00:00 1008
9: 2014-01-09 01:00:00 1009
10: 2014-02-08 01:00:00 1010
The fix is to test separately for each row of the matrix, using apply
.
DLStest <- function(dd, DLSobj) {
apply(sapply(DLSobj, function(X) dd %between% X), 1, any)
}
This now works:
> (IR1[DLStest(tstamp, DLSlist), dLoc := dLoc + 1000L])
tstamp dLoc
1: 2011-11-27 01:00:00 1001
2: 2012-04-03 00:00:00 2
3: 2012-06-01 00:00:00 3
4: 2012-09-06 00:00:00 4
5: 2013-03-09 01:00:00 1005
6: 2013-04-25 00:00:00 6
7: 2013-05-25 00:00:00 7
8: 2013-12-29 01:00:00 1008
9: 2014-01-09 01:00:00 1009
10: 2014-02-08 01:00:00 1010
Upvotes: 0
Views: 229
Reputation: 66819
You want to subset with a logical vector. In your initial formulation, the function only returns a single value (instead of a vector), causing your assignment to affect all or none of the rows.
IR <- copy(IR1)
DLStest_old <- function(dd, DLSobj) {
any(sapply(DLSobj, function(X) dd %between% X))
}
# on the whole tstamp vector at once
IR[,DLStest_old(tstamp, DLSlist)]
# TRUE
One solution is to use your function, but apply it "by row":
# by row
IR[,DLStest_old(tstamp, DLSlist),by=1:nrow(IR)]$V1
# TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Note that I'm putting this in the j
position of the data.table to return the result. Typically, to subset by an expression it can be put into the i
position (before the first comma), however "by" does not apply to i
expressions, so for this approach it's probably best to save the logical vector and then subset by it:
# by row, for use in i
change_em <- IR[,DLStest_old(tstamp, DLSlist),by=1:nrow(IR)]$V1
IR[change_em,tstamp:=tstamp+1e15][]
I busted your dates to make the changes more clear, resulting in:
tstamp dLoc
1: ))0'-06-03 15:45:52 1
2: 2012-04-03 00:00:00 2
3: 2012-06-01 00:00:00 3
4: 2012-09-07 00:00:00 4
5: ))0'-06-03 15:45:52 5
6: 2013-04-26 00:00:00 6
7: 2013-05-25 00:00:00 7
8: ))0'-06-03 15:45:52 8
9: ))0'-06-03 15:45:52 9
10: ))0'-06-03 15:45:52 10
Another solution that you found is to use something from the *apply
family:
DLStest_apply <- function(dd, DLSobj) {
apply(sapply(DLSobj, function(X) dd %between% X), 1, any)
}
# apply "any" on the margin of the sapply result
IR[,DLStest_apply(tstamp, DLSlist)]
# TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
apply
is made for matrices and arrays and the result of this sapply
is a matrix,
class(sapply(DLSlist, function(X) IR$tstamp %between% X))
# "matrix"
so this should be pretty fast. In general, sapply
can return different types of results.
P.S. I think dates are hard to read at a glance and it's best not to use them in your examples if you can tell ahead of time that you don't need them.
Upvotes: 1
Reputation: 12905
Your data looks like it doesn't have overlapping ranges in DLSlist, in which case this should work -
library(data.table)
#creating the data
DLSlist <- data.table(read.csv(textConnection('
"2011-10-02", "2012-04-01"
"2012-10-07", "2013-04-07"
"2013-10-06", "2014-04-06"
"2014-10-05", "2015-04-05"
"2015-10-04", "2016-04-03"
"2016-10-02", "2017-04-02"'), header = FALSE))
IR1 <- data.table(
tstamp = c("2011-10-01", "2012-10-06", "2014-10-07","2016-10-03")
)
#fixing data type
IR1[,tstamp := as.Date(tstamp,"%Y-%m-%d")]
DLSlist[,V1 := as.Date(V1,"%Y-%m-%d")]
DLSlist[,V2 := as.Date(V2,"%Y-%m-%d")]
DLSlist[,tstamp := V1]
#setting a key for data.table to find the closest match
setkey(IR1,tstamp)
setkey(DLSlist,tstamp)
#roll = Inf finds the closest match for the key
IR2 <- DLSlist[IR1, roll = Inf]
#Doing the operation where condition is satisfied
IR2[tstamp > V1 & tstamp < V2 , tstamp2 := tstamp + 60*60]
Output
> IR2
tstamp V1 V2 tstamp2
1: 2011-10-01 <NA> <NA> <NA>
2: 2012-10-06 2011-10-02 2012-04-01 <NA>
3: 2014-10-07 2014-10-05 2015-04-05 2024-08-15
4: 2016-10-03 2016-10-02 2017-04-02 2026-08-12
If you do have overlapping ranges then you create something like a set of all dates on which to perform this operation, and merge it back to IR1
to see which dates fall in this set. You can get a list of all the dates on which to perform this operation like this -
DLSlist2 <- unique(DLSlist[,list(DatesToFix = seq.Date(V1, V2, by = "day")), by = "V1"][,V1 := NULL])
I trust that you will be able to put this logic as a function.
Upvotes: 1