Reputation: 2288
I am analysing air traffic movements at an airport. My data set comprises aircraft block off times (leaving the gate) and the respective take-off times. I am looking for an efficient way to count the (cumulative) occurrence of take-off events based on a condition given by the block-time of a flight.
Being relatively new to R, I have managed to code this by
My solution is pretty slow already for a month of data (~ 50.000 flights), so it will be cumbersome to analyse larger time frames of one or two years.
I failed to find a similar problem on stackoverflow (or elsewhere) that applies to my problem. Neither could I make an apply()
or sapply()
work properly.
This is my code:
## count depeartures before own off-block
data$CUM_DEPS <- rep(NA, nrow(data)) # initialise column for dep count
for(i in 1:nrow(data)){ # loop over the data
data$CUM_DEPS[i] <- nrow(data[data$TAKE_OFF_TIME < data$BLOCK_OFF_TIME[i],])
}
Any pointers?
As suggested, this is a snapshot of the data and the result column i created.
FLTID TAKE_OFF_TIME BLOCK_OFF_TIME CUM_DEPS
Flight1 2013-07-01 05:02:42 2013-07-01 04:51:00 0
Flight2 2013-07-01 05:04:30 2013-07-01 04:53:52 0
Flight3 2013-07-01 05:09:01 2013-07-01 04:55:14 0
Flight4 2013-07-01 05:10:30 2013-07-01 05:00:57 0
Flight5 2013-07-01 05:12:58 2013-07-01 05:00:06 0
Flight6 2013-07-01 05:18:45 2013-07-01 05:04:14 1
Flight7 2013-07-01 05:22:12 2013-07-01 05:03:39 1
Flight8 2013-07-01 05:26:02 2013-07-01 05:09:32 3
Flight9 2013-07-01 05:27:24 2013-07-01 05:19:24 6
Flight10 2013-07-01 05:31:32 2013-07-01 05:17:05 5
Upvotes: 2
Views: 315
Reputation: 13122
You could check where, in-between "TAKE_OFF_TIME"s, each "BLOCK_OFF_TIME" falls. findInterval
is fast for this; the following looks valid, but maybe you'll have to check findInterval
's arguments to suit your exact problem.
findInterval(as.POSIXct(DF[["BLOCK_OFF_TIME"]]),
as.POSIXct(DF[["TAKE_OFF_TIME"]]))
#[1] 0 0 0 0 0 1 1 3 6 5
And, for the record, the loop using sapply
:
BOT = as.POSIXct(DF[["BLOCK_OFF_TIME"]])
TOT = as.POSIXct(DF[["TAKE_OFF_TIME"]])
sapply(BOT, function(x) sum(TOT < x))
#[1] 0 0 0 0 0 1 1 3 6 5
Where "DF":
DF = structure(list(FLTID = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 2L), .Label = c("Flight1", "Flight10", "Flight2", "Flight3",
"Flight4", "Flight5", "Flight6", "Flight7", "Flight8", "Flight9"
), class = "factor"), TAKE_OFF_TIME = structure(1:10, .Label = c("2013-07-01 05:02:42",
"2013-07-01 05:04:30", "2013-07-01 05:09:01", "2013-07-01 05:10:30",
"2013-07-01 05:12:58", "2013-07-01 05:18:45", "2013-07-01 05:22:12",
"2013-07-01 05:26:02", "2013-07-01 05:27:24", "2013-07-01 05:31:32"
), class = "factor"), BLOCK_OFF_TIME = structure(c(1L, 2L, 3L,
5L, 4L, 7L, 6L, 8L, 10L, 9L), .Label = c("2013-07-01 04:51:00",
"2013-07-01 04:53:52", "2013-07-01 04:55:14", "2013-07-01 05:00:06",
"2013-07-01 05:00:57", "2013-07-01 05:03:39", "2013-07-01 05:04:14",
"2013-07-01 05:09:32", "2013-07-01 05:17:05", "2013-07-01 05:19:24"
), class = "factor"), CUM_DEPS = c(0L, 0L, 0L, 0L, 0L, 1L, 1L,
3L, 6L, 5L)), .Names = c("FLTID", "TAKE_OFF_TIME", "BLOCK_OFF_TIME",
"CUM_DEPS"), class = "data.frame", row.names = c(NA, -10L))
Upvotes: 1
Reputation: 721
From above code, it seems like you are doing one-to-many comparison.
The thing that makes your code slow is that you are subsetting data based on boolean index for every single step.
data$CUM_DEPS <- rep(NA, nrow(data))
take_off_time = data$TAKE_OFF_TIME
for(i in 1:nrow(data)){
data$CUM_DEPS[i] = sum(data$BLOCK_OFF_TIME[i] > take_off_time)
}
This small modification will make it much faster, although I cannot say with an exact number since I do not have a reproducible example.
The biggest difference is that I store date vector 'take_off_time' rather than calling from the dataframe for every single iteration, and not subsetting data based on boolean, but summing single boolean.
Above all is from the assumption that you have processed date correctly so that it can be compared with inequality.
Upvotes: 1