Reputation: 13
I'm fairly new in R and need some help. I have two dataframes with rather similar information. The first dataframe has information about misconnections for an airline, whereas the other one is the entire timetable for the same airline. Now, what I need is to make a new column in the misconnection data.frame including flights from the timetable that can replace the delayed flights on the transit.
The flights that I want to replace need to meet a range of conditions (within a certain time-horizon, needs to be the same weekday and it needs to fly to the same destination). I addition, I want R to choose the flight that is closest (by time) to the new arrival time at a transit(from the misconnection data.frame).
The misconnection data.frame looks like the following (1620 lines in total):
miscon <- data.frame(flight.date = as.Date(c("2019-08-05", "2019-10-03", "2019-07-21", "2019-05-29"), format="%Y-%m-%d"),
Outbound.airport = c("MXP", "KRK", "KLU", "OTP"),
arr.time = as.POSIXct(c("19:25:00", "20:52:00", "07:33:00", "18:49:00"), format="%H:%M:%S"),
next.pos.dep = as.POSIXct(c("19:36:00", "21:17:00", "07:58:00", "19:14:00"), format="%H:%M:%S"),
weekday = c("4", "7", "7", "3"))
view(miscon)
flight.date Outbound.airport arr.time next.pos.dep Weekday
1 2019-08-05 MXP 19:25:00 19:36:00 4
2 2019-10-03 KRK 20:52:00 21:17:00 7
3 2019-07-21 KLU 07:33:00 07:58:00 7
4 2019-05-29 OTP 18:49:00 19:14:00 3
And the timetable data.frame would look like this:
tt <- data.frame(start.date = as.Date(c("2019-03-25", "2019-05-02", "2019-07-30", "2019-05-29"), format="%Y-%m-%d"),
end.date = as.Date(c("2019-10-21", "2019-10-27", "2019-08-26", "2019-06-01"), format="%Y-%m-%d"),
weekday = c("1234567", "1.3..67", "1.34567", "..3.5.."),
Outbound.airport = c("KLU", "KLU", "MXP", "OTP"),
dep.time = as.POSIXct(c("12:20:00", "15:55:00", "19:55:00", "20:34:00"), format="%H:%M:%S"))
view(tt)
start.date end.date Weekday Outbound.airport dep.time
1 2019-03-25 2019-10-21 1234567 KLU 12:20:00
2 2019-05-02 2019-10-27 1.3..67 KLU 15:55:00
3 2019-07-30 2019-08-26 1.34567 MXP 19:55:00
4 2019-03-30 2019-06-01 ..3.5.. OTP 20:34:00
In Excel, this problem is solved using Index matching, which I've managed. However, the problem is slightly to big for excel to handle which is why I need to convert this to R. Did try with the match and mutate function in R, but seems like the values I'm matching must be equal - which I do not expect mine to be.
Also found an interesting solution to a similar problem using the DescTools package, which I tried to implemt with no success.
get_close2 <- function(xx=tt, yy=miscon) {
pos <- vector(mode = "numeric")
for(i in 1:dim(yy)[1]) {
pos[i] <- DescTools::Closest(xx$dep.time, yy$next.pos.dep[i])
#print(pos[i])
yy$new.flight[i] <- pos[i]
}
out <- yy
return(out)
}
get_close2()
For this one, I tried with only one condition. It generated a column, but with NA's only. Obviously, I am far out right now, which is why I'm reaching out for help. Hope the problem was clear. The end result would preferrably look something like the following:
miscon
flight.date Outbound.airport arr.time next.pos.dep Weekday new.flight.time
1 2019-12-05 MXP 19:25:00 19:36:00 4 19:55:00
2 2019-10-03 KRK 20:52:00 21:17:00 7 NA
3 2019-07-21 KLU 07:33:00 07:58:00 7 12:20:00
4 2019-05-29 OTP 18:49:00 19:14:00 3 20:34:00
Upvotes: 1
Views: 138
Reputation: 1999
I think you can do it as follows. First, I would rearrange the Weekday
column so that you have one row for each weekday a flight is going:
library(data.table)
library(dplyr)
library(tidyr)
tt <- tt %>% separate(weekday, into = as.character(1:7), sep = 1:6) %>%
gather(key="key", value="weekday", -c(start.date, end.date, Outbound.airport, dep.time)) %>%
filter(weekday %in% 1:7) %>%
select(-key)
Then I would do a left join of miscon
and tt
on the airport and weekday.
tt <- data.table(tt)
miscon <- data.table(miscon)
setkey(miscon, Outbound.airport, weekday)
setkey(tt, Outbound.airport, weekday)
df <- tt[miscon]
Check if flight date is on a valid date:
df = df[flight.date>=start.date & flight.date<=end.date]
Now you have a data.frame of all possible connections. The only thing left is to find the minimum time between the flights for each connection.
df[,timediff:= dep.time-arr.time, by=.(weekday, Outbound.airport)]
Now you can filter the rows by the minimum time delay (timediff
):
df = df[ , .SD[which.min(timediff)], by=.(weekday, Outbound.airport, flight.date, arr.time, next.pos.dep)]
setnames(df, "dep.time", "new.flight.time")
> df
weekday Outbound.airport flight.date arr.time next.pos.dep start.date end.date new.flight.time timediff
1: 7 KLU 2019-07-21 2020-04-27 07:33:00 2020-04-27 07:58:00 2019-03-25 2019-10-21 2020-04-27 12:20:00 17220 secs
2: 4 MXP 2019-08-05 2020-04-27 19:25:00 2020-04-27 19:36:00 2019-07-30 2019-08-26 2020-04-27 19:55:00 1800 secs
3: 3 OTP 2019-05-29 2020-04-27 18:49:00 2020-04-27 19:14:00 2019-05-29 2019-06-01 2020-04-27 20:34:00 6300 secs
The solution is a bit of a mix of dplyr
and data.table
.
Upvotes: 0
Reputation: 4169
Ok, it's not pretty but you have a fairly complex issue, and it's not fully clear to me if this gives you what you are looking for - you will need to check it on a larger dataset than the small example you provide to be sure first.
# setup
library(data.table)
setDT(tt)
setDT(miscon)
# make tt long format splitting weekdays out
tt <- melt(tt[, paste("V", 1:7, sep = "") := tstrsplit(weekday, "")][, -"weekday"], measure.vars = paste("V", 1:7, sep = ""))[value != "."][, c("weekday", "value", "variable") := .(value, NULL, NULL)]
# join, calculate time difference, convert format of times, rank on new.dep.time within group, and filter
newDT <- miscon[tt, on = c("Outbound.airport", "weekday"), nomatch = 0][
, new.dep.time := as.numeric(dep.time - arr.time)][
, c("arr.time", "dep.time", "next.pos.dep") := .(format(arr.time, "%H:%M"), format(dep.time, "%H:%M"), format(next.pos.dep, "%H:%M"))][
, new.dep.rank := rank(new.dep.time), by = c("Outbound.airport", "weekday")][
new.dep.rank == 1, -c("new.dep.rank", "new.dep.time")]
Upvotes: 0