Reputation: 493
I have one data table which contains just a sequence of times. I have another data table containing two columns: start_time and end_time. I want to take the first data table and add a column where the value is the count of all of the rows in the second data table where the time from the first data table fits within the start and end time. Here is my code
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
Here is what i want to do, but this is wrong and gives an error. What's the right way to write this?
all_dates[, BinCount := input_data[start_times < Bin & end_times > Bin, .N] ]
In the end i should get something like
Bin BinCount
2017-01-31 17:00:00 1
2017-01-31 17:01:00 5
...
Upvotes: 2
Views: 88
Reputation: 15062
Wasn't requested, but here is a compact alternative solution using the tidyverse
. Uses lubridate
parsers, interval
, and %within%
, as well as purrr::map_int
to generate the desired bin counts.
library(tidyverse)
library(lubridate)
start_date <- ymd_hms(x = "2017-01-31 17:00:00") # lubridate parsers
end_date <- ymd_hms(x = "2017-02-01 09:00:00")
all_dates <- tibble(seq(start_date, end_date, "min")) # tibble swap for data.table
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- tibble(
start_times,
end_times,
intvl = interval(start_times, end_times) # Add interval column
)
all_dates %>% # Checks date in Bin and counts intervals it lies within
mutate(BinCount = map_int(.$Bin, ~ sum(. %within% input_data$intvl)))
# A tibble: 961 x 2
Bin BinCount
<dttm> <int>
1 2017-01-31 17:00:00 0
2 2017-01-31 17:01:00 0
3 2017-01-31 17:02:00 0
4 2017-01-31 17:03:00 0
5 2017-01-31 17:04:00 0
6 2017-01-31 17:05:00 0
7 2017-01-31 17:06:00 0
8 2017-01-31 17:07:00 1
9 2017-01-31 17:08:00 1
10 2017-01-31 17:09:00 1
# ... with 951 more rows
Upvotes: 1
Reputation: 26248
In data.table
you're after a range join.
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
set.seed(123)
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
## doing the range-join and calculating the number of items per bin in one chained step
input_data[
all_dates
, on = .(start_times < Bin, end_times > Bin)
, nomatch = 0
, allow.cartesian = T
][, .N, by = start_times]
# start_times N
# 1: 2017-01-31 17:01:00 1
# 2: 2017-01-31 17:02:00 1
# 3: 2017-01-31 17:03:00 1
# 4: 2017-01-31 17:04:00 1
# 5: 2017-01-31 17:05:00 1
# ---
# 956: 2017-02-01 08:56:00 6
# 957: 2017-02-01 08:57:00 4
# 958: 2017-02-01 08:58:00 4
# 959: 2017-02-01 08:59:00 5
# 960: 2017-02-01 09:00:00 5
Note:
all_dates
object on the right-hand-side of the join, so the result contains the names of the input_data
columns, even though they are your Bins (see this issue for the discussion on this topic)set.seed()
, as you're taking samplesUpvotes: 2
Reputation: 20085
The problem can be solved very easily using sqldf
as it provides easy way to join tables with range checking. Hence one solution could be:
The data from OP:
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
library(sqldf)
result <- sqldf("SELECT all_dates.bin, count() as BinCount
FROM all_dates, input_data
WHERE all_dates.bin > input_data.start_times AND
all_dates.bin < input_data.end_times
GROUP BY bin" )
result
Bin BinCount
1 2017-01-31 17:01:00 1
2 2017-01-31 17:02:00 1
3 2017-01-31 17:03:00 1
4 2017-01-31 17:04:00 1
5 2017-01-31 17:05:00 1
6 2017-01-31 17:06:00 1
...........
...........
497 2017-02-01 01:17:00 6
498 2017-02-01 01:18:00 5
499 2017-02-01 01:19:00 5
500 2017-02-01 01:20:00 4
[ reached getOption("max.print") -- omitted 460 rows ]
Upvotes: 3