Ray
Ray

Reputation: 2288

Counting occurence based on condition for each element of a column

I am analysing air traffic movements at an airport. My data set comprises aircraft block off times (leaving the gate) and the respective take-off times. I am looking for an efficient way to count the (cumulative) occurrence of take-off events based on a condition given by the block-time of a flight.

Being relatively new to R, I have managed to code this by

  1. looping over all rows of my data;
  2. subsetting the data for the block time (condition event) in that row; and
  3. counting the number of rows for the (temporary) data frame.

My solution is pretty slow already for a month of data (~ 50.000 flights), so it will be cumbersome to analyse larger time frames of one or two years. I failed to find a similar problem on stackoverflow (or elsewhere) that applies to my problem. Neither could I make an apply() or sapply() work properly.

This is my code:

## count depeartures before own off-block
data$CUM_DEPS <- rep(NA, nrow(data))      # initialise column for dep count

for(i in 1:nrow(data)){                   # loop over the data
    data$CUM_DEPS[i] <- nrow(data[data$TAKE_OFF_TIME < data$BLOCK_OFF_TIME[i],])
} 

Any pointers?

As suggested, this is a snapshot of the data and the result column i created.

    FLTID        TAKE_OFF_TIME       BLOCK_OFF_TIME    CUM_DEPS
Flight1  2013-07-01 05:02:42  2013-07-01 04:51:00    0
Flight2  2013-07-01 05:04:30  2013-07-01 04:53:52    0
Flight3  2013-07-01 05:09:01  2013-07-01 04:55:14    0
Flight4  2013-07-01 05:10:30  2013-07-01 05:00:57    0
Flight5  2013-07-01 05:12:58  2013-07-01 05:00:06    0
Flight6  2013-07-01 05:18:45  2013-07-01 05:04:14    1
Flight7  2013-07-01 05:22:12  2013-07-01 05:03:39    1
Flight8  2013-07-01 05:26:02  2013-07-01 05:09:32    3
Flight9  2013-07-01 05:27:24  2013-07-01 05:19:24    6
Flight10 2013-07-01 05:31:32  2013-07-01 05:17:05    5

Upvotes: 2

Views: 315

Answers (2)

alexis_laz
alexis_laz

Reputation: 13122

You could check where, in-between "TAKE_OFF_TIME"s, each "BLOCK_OFF_TIME" falls. findInterval is fast for this; the following looks valid, but maybe you'll have to check findInterval's arguments to suit your exact problem.

findInterval(as.POSIXct(DF[["BLOCK_OFF_TIME"]]), 
             as.POSIXct(DF[["TAKE_OFF_TIME"]]))
#[1] 0 0 0 0 0 1 1 3 6 5

And, for the record, the loop using sapply:

BOT = as.POSIXct(DF[["BLOCK_OFF_TIME"]])
TOT = as.POSIXct(DF[["TAKE_OFF_TIME"]])
sapply(BOT, function(x) sum(TOT < x))
#[1] 0 0 0 0 0 1 1 3 6 5

Where "DF":

DF = structure(list(FLTID = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 2L), .Label = c("Flight1", "Flight10", "Flight2", "Flight3", 
"Flight4", "Flight5", "Flight6", "Flight7", "Flight8", "Flight9"
), class = "factor"), TAKE_OFF_TIME = structure(1:10, .Label = c("2013-07-01 05:02:42", 
"2013-07-01 05:04:30", "2013-07-01 05:09:01", "2013-07-01 05:10:30", 
"2013-07-01 05:12:58", "2013-07-01 05:18:45", "2013-07-01 05:22:12", 
"2013-07-01 05:26:02", "2013-07-01 05:27:24", "2013-07-01 05:31:32"
), class = "factor"), BLOCK_OFF_TIME = structure(c(1L, 2L, 3L, 
5L, 4L, 7L, 6L, 8L, 10L, 9L), .Label = c("2013-07-01 04:51:00", 
"2013-07-01 04:53:52", "2013-07-01 04:55:14", "2013-07-01 05:00:06", 
"2013-07-01 05:00:57", "2013-07-01 05:03:39", "2013-07-01 05:04:14", 
"2013-07-01 05:09:32", "2013-07-01 05:17:05", "2013-07-01 05:19:24"
), class = "factor"), CUM_DEPS = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 
3L, 6L, 5L)), .Names = c("FLTID", "TAKE_OFF_TIME", "BLOCK_OFF_TIME", 
"CUM_DEPS"), class = "data.frame", row.names = c(NA, -10L))

Upvotes: 1

won782
won782

Reputation: 721

From above code, it seems like you are doing one-to-many comparison.

The thing that makes your code slow is that you are subsetting data based on boolean index for every single step.

data$CUM_DEPS <- rep(NA, nrow(data)) 
take_off_time = data$TAKE_OFF_TIME

for(i in 1:nrow(data)){
    data$CUM_DEPS[i] = sum(data$BLOCK_OFF_TIME[i] > take_off_time)
} 

This small modification will make it much faster, although I cannot say with an exact number since I do not have a reproducible example.

The biggest difference is that I store date vector 'take_off_time' rather than calling from the dataframe for every single iteration, and not subsetting data based on boolean, but summing single boolean.

Above all is from the assumption that you have processed date correctly so that it can be compared with inequality.

Upvotes: 1

Related Questions