extracting number of items in date range for each row in huge data frame in R

Question

I was wondering if anyone had some suggestions for efficient function to be applied to a big data frame (2m rows, 50 variables) to pull out number of items that lie within a date range specified in the row. For example my data frame looks something like this:

row.name   start_time          end_time              jid     hostname
1930525    2016-01-21 15:41:35 2016-01-21 15:47:40   1976235 hpc1node10
1930526    2016-01-21 15:40:50 2016-01-21 15:47:44   1976230 hpc1node08
1930527    2016-01-21 15:37:20 2016-01-21 15:47:46   1976192 hpc1node16
1930528    2016-01-21 15:43:05 2016-01-21 15:47:53   1976533 hpc1node13
1930529    2016-01-21 15:37:35 2016-01-21 15:47:54   1976197 hpc1node16

I want to count for each row (job) how many other jobs have been running at the same time. So which jobs finished in between start_time and end_time and jobs started in between start_time and end_time.

I tried coming up with a function to use with by, but it's woefully slow. Yet I can't think of a better way. My stab:

get_range <- function(r, d=NULL) {

  # Attempt at reducing the number of items by only 
  #   looking at 500 either side for the same hostname
  dd = subset(d, jobnumber>r$jobnumber-500&jobnumber=r$start_time & dd$end_time <= r$end_time) | 
    (dd$start_time >= r$start_time & dd$start_time <= r$end_time)
    )

  running_jobs
}

then running

by(d, get_range, d=d)

Any ideas?

extracting number of items in date range for each row in huge data frame in R

Answers (1)

Related Questions