heatherr
heatherr

Reputation: 143

Select rows by date within interval in data.table

I would like to select the observations in a data table that fall within intervals of time specified in a second data table - the intervals are periods of time when observations are made from 2 platforms concurrently.

The first data table looks something like this. It is a bunch of animal sightings.

obs = data.table(sighting = as.POSIXct(c("2018-08-12 16:30:00", "2018-08-12 16:35:00", "2018-08-12 16:38:00", "2107-08-13 15:13:00", "2107-08-13 16:13:00", "2017-08-14 11:12:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), encounter = c("1", "1", "1", "2", "3", "4"), what = c("frog", "frog", "toad", "bird", "goat","bird"))

Observations were made from 2 platforms.

platformA = data.table(station = "A", on.effort = as.POSIXct(c("2018-08-12 16:00:00", "2018-08-12 17:35:00","2017-08-14 11:00:13", "2018-08-15 17:35:00"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), off.effort = as.POSIXct(c("2018-08-12 16:36:00", "2018-08-12 18:35:00","2017-08-14 12:12:13", "2018-08-15 18:35:00"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"))

platformB = data.table(station = "B", on.effort = as.POSIXct(c("2018-08-12 16:15:00", "2018-08-12 17:40:00", "2018-08-13 17:40:00","2017-08-14 11:05:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), off.effort = as.POSIXct(c("2018-08-12 16:40:00", "2018-08-13 17:45:00", "2018-08-12 18:20:00","2017-08-14 12:30:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"))

I first calculated intervals for each platform, and then intersected the intervals to find out when observations were made concurrently.

setkey(platformA, on.effort, off.effort)
setkey(platformB, on.effort, off.effort)

common = foverlaps(platformA, platformB,type="any",nomatch=0)

common$x = intersect(interval(common$on.effort, common$off.effort), 
                     interval(common$i.on.effort, common$i.off.effort))

I'd like to end up with a table that is a subset of "obs" and contains only the rows covered by the intervals in "common$x". I had hoped to use foverlaps to find the rows in the intersected intervals, and created "point" intervals for my sightings with

obs[, sighting2 := sighting]

But foverlaps wants the "start" and "end" of each interval in separate columns, which is not how the intervals are stored in common$x.

I would like my output to look like this

           sighting encounter what
2018-08-12 16:30:00         1 frog
2018-08-12 16:35:00         1 frog
2017-08-14 11:12:13         4 bird

I'd appreciate any tips. Perhaps I could have been more efficient earlier on? Thanks.

Upvotes: 1

Views: 429

Answers (2)

Luke C
Luke C

Reputation: 10301

I think this should work even if you have varying observation numbers between the platforms. Using your obs, platformA, and platformB data as above, make intervals for both platforms more or less as you did above in common:

common = intersect(interval(platformA$on.effort, platformA$off.effort), 
                   interval(platformB$on.effort, platformB$off.effort))

You should be able to use %within% to check if there are any cases where the sighting falls within the common interval:

obs$both.seen <- sapply(obs$sighting, function(s){
  any(s %within% common)
})

OR

obs[, both.seen := sapply(sighting, function(x) any(x %within% common))]

New obs:

> obs
              sighting encounter what both.seen
1: 2018-08-12 16:30:00         1 frog      TRUE
2: 2018-08-12 16:35:00         1 frog      TRUE
3: 2018-08-12 16:38:00         1 toad     FALSE
4: 2107-08-13 15:13:00         2 bird     FALSE
5: 2107-08-13 16:13:00         3 goat     FALSE
6: 2017-08-14 11:12:13         4 bird      TRUE

Subset to get to your desired output:

obs <- obs[both.seen == 1][, both.seen := NULL][]

> obs
              sighting encounter what
1: 2018-08-12 16:30:00         1 frog
2: 2018-08-12 16:35:00         1 frog
3: 2017-08-14 11:12:13         4 bird

Upvotes: 1

tblznbits
tblznbits

Reputation: 6778

I believe this gets you what you want. It doesn't take advantage of data.table functions, and runs entirely on base R. I'm not sure if this will lead to performance issues in your data, but perhaps it gives a way to think about more data.table-esque functions.

library(data.table)

# Set up the data
obs = data.table(sighting = as.POSIXct(c("2018-08-12 16:30:00", 
                                         "2018-08-12 16:35:00", 
                                         "2018-08-12 16:38:00", 
                                         "2107-08-13 15:13:00", 
                                         "2107-08-13 16:13:00", 
                                         "2017-08-14 11:12:13"), 
                                       format = "%Y-%m-%d %H:%M:%OS", 
                                       tz = "America/Halifax"), 
                 encounter = c("1", "1", "1", "2", "3", "4"), 
                 what = c("frog", "frog", "toad", "bird", "goat","bird"))

platformA = data.table(station = "A", 
                       on.effort = as.POSIXct(c("2018-08-12 16:00:00", 
                                                "2018-08-12 17:35:00",
                                                "2017-08-14 11:00:13"), 
                                              format = "%Y-%m-%d %H:%M:%OS", 
                                              tz = "America/Halifax"), 
                       off.effort = as.POSIXct(c("2018-08-12 16:36:00", 
                                                 "2018-08-12 18:35:00",
                                                 "2017-08-14 12:12:13"), 
                                               format = "%Y-%m-%d %H:%M:%OS", 
                                               tz = "America/Halifax"))

platformB = data.table(station = "B", 
                       on.effort = as.POSIXct(c("2018-08-12 16:15:00", 
                                                "2018-08-12 17:40:00",
                                                "2017-08-14 11:05:13"), 
                                              format = "%Y-%m-%d %H:%M:%OS", 
                                              tz = "America/Halifax"), 
                       off.effort = as.POSIXct(c("2018-08-12 16:40:00", 
                                                 "2018-08-12 18:20:00",
                                                 "2017-08-14 12:30:13"), 
                                               format = "%Y-%m-%d %H:%M:%OS", 
                                               tz = "America/Halifax"))

# Get the start and end times for each observation (note use of pmax and pmin)
starts = pmax(platformA$on.effort, platformB$on.effort)
ends = pmin(platformA$off.effort, platformB$off.effort)

# For each sighting in obs check if it falls in between any of the intervals
seen = sapply(obs$sighting, function(x) {    
    any(x >= starts & x <= ends)
})

# Subset the data
obs[seen, ]

              sighting encounter what
1: 2018-08-12 16:30:00         1 frog
2: 2018-08-12 16:35:00         1 frog
3: 2017-08-14 11:12:13         4 bird

The primary aspect of this solution is the assignment of start and end. Since we're looking for the intersection of time of observation on two platforms, our start time is the later time of the two (i.e. the max) and our end time in the earliest time of the two platforms (i.e. the min). By using pmin and pmax, we can take get the min and max, respectively, element wise for the vectors of time. When making the comparison in x >= start & x <= min, a single time x is compared element wise to a pair of times start[i] and end[i] thereby giving us our interval of comparison.

Upvotes: 0

Related Questions