Hansen2
Hansen2

Reputation: 13

Match values with multiple conditions using two data.frames

I'm fairly new in R and need some help. I have two dataframes with rather similar information. The first dataframe has information about misconnections for an airline, whereas the other one is the entire timetable for the same airline. Now, what I need is to make a new column in the misconnection data.frame including flights from the timetable that can replace the delayed flights on the transit.

The flights that I want to replace need to meet a range of conditions (within a certain time-horizon, needs to be the same weekday and it needs to fly to the same destination). I addition, I want R to choose the flight that is closest (by time) to the new arrival time at a transit(from the misconnection data.frame).

The misconnection data.frame looks like the following (1620 lines in total):

miscon <- data.frame(flight.date = as.Date(c("2019-08-05", "2019-10-03", "2019-07-21", "2019-05-29"), format="%Y-%m-%d"),
                     Outbound.airport = c("MXP", "KRK", "KLU", "OTP"),  
                     arr.time = as.POSIXct(c("19:25:00", "20:52:00", "07:33:00", "18:49:00"), format="%H:%M:%S"),    
                     next.pos.dep = as.POSIXct(c("19:36:00", "21:17:00", "07:58:00", "19:14:00"), format="%H:%M:%S"),
                     weekday = c("4", "7", "7", "3"))

view(miscon)

        flight.date    Outbound.airport    arr.time    next.pos.dep    Weekday
1       2019-08-05     MXP                 19:25:00    19:36:00        4
2       2019-10-03     KRK                 20:52:00    21:17:00        7
3       2019-07-21     KLU                 07:33:00    07:58:00        7
4       2019-05-29     OTP                 18:49:00    19:14:00        3

And the timetable data.frame would look like this:

tt <- data.frame(start.date = as.Date(c("2019-03-25", "2019-05-02", "2019-07-30", "2019-05-29"), format="%Y-%m-%d"),
                 end.date = as.Date(c("2019-10-21", "2019-10-27", "2019-08-26", "2019-06-01"), format="%Y-%m-%d"),
                 weekday = c("1234567", "1.3..67", "1.34567", "..3.5.."),
                 Outbound.airport = c("KLU", "KLU", "MXP", "OTP"),  
                 dep.time = as.POSIXct(c("12:20:00", "15:55:00", "19:55:00", "20:34:00"), format="%H:%M:%S"))    

view(tt)

    start.date    end.date     Weekday     Outbound.airport    dep.time
1   2019-03-25    2019-10-21   1234567     KLU                 12:20:00   
2   2019-05-02    2019-10-27   1.3..67     KLU                 15:55:00
3   2019-07-30    2019-08-26   1.34567     MXP                 19:55:00
4   2019-03-30    2019-06-01   ..3.5..     OTP                 20:34:00

In Excel, this problem is solved using Index matching, which I've managed. However, the problem is slightly to big for excel to handle which is why I need to convert this to R. Did try with the match and mutate function in R, but seems like the values I'm matching must be equal - which I do not expect mine to be.

Also found an interesting solution to a similar problem using the DescTools package, which I tried to implemt with no success.

get_close2 <- function(xx=tt, yy=miscon) {
  pos <- vector(mode = "numeric")
  for(i in 1:dim(yy)[1]) {
    pos[i] <- DescTools::Closest(xx$dep.time, yy$next.pos.dep[i])
    #print(pos[i])
    yy$new.flight[i] <- pos[i]
  }
  out <- yy
  return(out)
}

get_close2()

For this one, I tried with only one condition. It generated a column, but with NA's only. Obviously, I am far out right now, which is why I'm reaching out for help. Hope the problem was clear. The end result would preferrably look something like the following:

miscon
        flight.date    Outbound.airport    arr.time    next.pos.dep    Weekday   new.flight.time
1       2019-12-05     MXP                 19:25:00    19:36:00        4         19:55:00
2       2019-10-03     KRK                 20:52:00    21:17:00        7         NA
3       2019-07-21     KLU                 07:33:00    07:58:00        7         12:20:00
4       2019-05-29     OTP                 18:49:00    19:14:00        3         20:34:00

Upvotes: 1

Views: 138

Answers (2)

otwtm
otwtm

Reputation: 1999

I think you can do it as follows. First, I would rearrange the Weekday column so that you have one row for each weekday a flight is going:

library(data.table)
library(dplyr)
library(tidyr)

tt <- tt %>% separate(weekday, into = as.character(1:7), sep = 1:6) %>% 
  gather(key="key", value="weekday", -c(start.date, end.date, Outbound.airport, dep.time)) %>%
  filter(weekday %in% 1:7) %>%
  select(-key)

Then I would do a left join of miscon and tt on the airport and weekday.

tt <- data.table(tt)
miscon <- data.table(miscon)
setkey(miscon, Outbound.airport, weekday)
setkey(tt, Outbound.airport, weekday)
df <- tt[miscon]

Check if flight date is on a valid date:

df = df[flight.date>=start.date & flight.date<=end.date]

Now you have a data.frame of all possible connections. The only thing left is to find the minimum time between the flights for each connection.

df[,timediff:= dep.time-arr.time, by=.(weekday, Outbound.airport)]

Now you can filter the rows by the minimum time delay (timediff):

df = df[ , .SD[which.min(timediff)],  by=.(weekday, Outbound.airport, flight.date, arr.time, next.pos.dep)]
setnames(df, "dep.time", "new.flight.time")

> df
   weekday Outbound.airport flight.date            arr.time        next.pos.dep start.date   end.date     new.flight.time   timediff
1:       7              KLU  2019-07-21 2020-04-27 07:33:00 2020-04-27 07:58:00 2019-03-25 2019-10-21 2020-04-27 12:20:00 17220 secs
2:       4              MXP  2019-08-05 2020-04-27 19:25:00 2020-04-27 19:36:00 2019-07-30 2019-08-26 2020-04-27 19:55:00  1800 secs
3:       3              OTP  2019-05-29 2020-04-27 18:49:00 2020-04-27 19:14:00 2019-05-29 2019-06-01 2020-04-27 20:34:00  6300 secs

The solution is a bit of a mix of dplyr and data.table.

Upvotes: 0

rg255
rg255

Reputation: 4169

Ok, it's not pretty but you have a fairly complex issue, and it's not fully clear to me if this gives you what you are looking for - you will need to check it on a larger dataset than the small example you provide to be sure first.

# setup
library(data.table)
setDT(tt)
setDT(miscon)

# make tt long format splitting weekdays out
tt <- melt(tt[, paste("V", 1:7, sep = "") := tstrsplit(weekday, "")][, -"weekday"], measure.vars = paste("V", 1:7, sep = ""))[value != "."][, c("weekday", "value", "variable") := .(value, NULL, NULL)]

# join, calculate time difference, convert format of times, rank on new.dep.time within group, and filter
newDT <- miscon[tt, on = c("Outbound.airport", "weekday"), nomatch = 0][
  , new.dep.time := as.numeric(dep.time - arr.time)][
  , c("arr.time", "dep.time", "next.pos.dep") := .(format(arr.time, "%H:%M"), format(dep.time, "%H:%M"), format(next.pos.dep, "%H:%M"))][
  , new.dep.rank := rank(new.dep.time), by = c("Outbound.airport", "weekday")][
  new.dep.rank == 1, -c("new.dep.rank", "new.dep.time")]

Upvotes: 0

Related Questions