Reputation: 1345
I was wondering if anyone had some suggestions for efficient function to be applied to a big data frame (2m rows, 50 variables) to pull out number of items that lie within a date range specified in the row. For example my data frame looks something like this:
row.name start_time end_time jid hostname
1930525 2016-01-21 15:41:35 2016-01-21 15:47:40 1976235 hpc1node10
1930526 2016-01-21 15:40:50 2016-01-21 15:47:44 1976230 hpc1node08
1930527 2016-01-21 15:37:20 2016-01-21 15:47:46 1976192 hpc1node16
1930528 2016-01-21 15:43:05 2016-01-21 15:47:53 1976533 hpc1node13
1930529 2016-01-21 15:37:35 2016-01-21 15:47:54 1976197 hpc1node16
I want to count for each row (job) how many other jobs have been running at the same time. So which jobs finished in between start_time
and end_time
and jobs started in between start_time
and end_time
.
I tried coming up with a function to use with by
, but it's woefully slow. Yet I can't think of a better way. My stab:
get_range <- function(r, d=NULL) {
# Attempt at reducing the number of items by only
# looking at 500 either side for the same hostname
dd = subset(d, jobnumber>r$jobnumber-500&jobnumber<r$jobnumber+500&hostname==r$hostname)
running_jobs = sum(
(dd$end_time>=r$start_time & dd$end_time <= r$end_time) |
(dd$start_time >= r$start_time & dd$start_time <= r$end_time)
)
running_jobs
}
then running
by(d, get_range, d=d)
Any ideas?
Upvotes: 0
Views: 105
Reputation: 83215
You can achieve this efficiently using the foverlaps
function from the data.table package which is specifically designed for large datasets:
library(data.table)
# converting to a 'data.table'
# and setting the keys to the time columns
# and adding an index column
setDT(d, key = c("hostname","start_time","end_time"))[, xid := .I]
# checking for overlaps & counting the number of overlaps
tmp <- foverlaps(d, d, which = TRUE)[, .N-1, xid]
# adding the count (N := V1) to 'd' by joining with 'tmp' on 'xid'
d[tmp, N := V1, on="xid"][, xid := NULL]
which gives:
> d
row.name start_time end_time jid hostname N
1: 1930526 2016-01-21 15:40:50 2016-01-21 15:47:44 1976230 hpc1node08 0
2: 1930525 2016-01-21 15:41:35 2016-01-21 15:47:40 1976235 hpc1node10 0
3: 1930528 2016-01-21 15:43:05 2016-01-21 15:47:53 1976533 hpc1node13 0
4: 1930527 2016-01-21 15:37:20 2016-01-21 15:47:46 1976192 hpc1node16 1
5: 1930529 2016-01-21 15:37:35 2016-01-21 15:47:54 1976197 hpc1node16 1
In the above solution I have used type = "any"
(this is not in the code above because this is the default) to look for other rows that have a start_time
and end_time
that overlap with the start_time
and end_time
of other rows within the hostname
groups. Other possible values for type
are: within
, start
& end
.
Furthermore, I used .N-1
to take account for 'self'-overlaps.
Upvotes: 2