Reputation: 117
I am working on recreating my Excel model in R due to the limited amount of data that Excel can process. The meat of the model takes two columns of flight data (inbound leg, outbound leg) and builds out lines of flight by matching the outbound leg to the inbound leg, placing the new outbound leg on the original line and repeating this process until there is no more inbound match for the outbound leg. Here is the working VBA code for this process.
For i = f To l
If i Mod 100 = 0 Then Application.StatusBar = "Progress: Step 4 of 18 - Building lines for " & ref.Cells(a, 39) & " (" & (a - 3) & " A/C types of " & (g - 3) & "), Line " & i - f & " of " & l - f & ")"
DoEvents
y = 0
b = 0
x = .Cells(i, 2)
y = Application.Match(.Cells(i, 2), LegTable, 0)
j = FirstTurn(y, 1)
If .Cells(i, 2) <> FirstTurn(y, 1) Then GoTo Nexti
NextLeg = NextLeg + 1
ReDim Preserve NextTurn(0, 1 To NextLeg)
NextTurn(0, NextLeg) = FirstTurn(y, 2)
Do
FTtext = FirstTurn(y, 2)
On Error GoTo errhdlr
b = Application.Match(FTtext, LegTable, 1)
If FTtext <> FirstTurn(b, 1) Then GoTo Nexti
NextLeg = NextLeg + 1
ReDim Preserve NextTurn(0, 1 To NextLeg)
NextTurn(0, NextLeg) = FirstTurn(b, 2)
y = b
Loop
errhdlr:
Resume Nexti
Nexti:
If NextLeg > 0 Then Range(.Cells(i, 3), .Cells(i, NextLeg + 2)).Value = NextTurn
Erase NextTurn
NextLeg = 0
Next i
Sample data would be
In Out
1 4
2 3
4 5
5 2
Output would be
1 4 5 2 3
2 3
4 5 2 3
5 2 3
In R, I have the following code
## Build Lines of Flight
b.list <- list(a = data.frame(leg1, leg2), b = data.frame((leg2)))
c.data <- join_all(b.list, by = leg2, type = "full", match = "all")
All this gives me is the original two columns. Thanks for the assistance.
Upvotes: 1
Views: 542
Reputation: 117
Thank you Mark, Roland, and Snoram, for the suggestions. While trying to get those to work, I created a solution that works with my data. Not sure how efficient it truly is, but it ran through 128K rows (ended up being 248 total columns wide) in less than 6 seconds so I can't complain (my excel model would take over 5 minutes for the same dataset). Thanks again for the assistance. Here is my code:
## Build Lines of Flight
nr <- nrow(b.data)
c <- 2
c.df <- b.df
nlegname <- paste("leg", c, sep = "")
y <- match(leg2, leg1)
while(all(is.na(y)) == FALSE)
{
y <- match(c.df[[c]], leg1)
d <- all(is.na(y))
nl <- b.df[y,"leg2"]
c.df <- add_column(c.df, nleg = nl)
c <-c+1
nlegname <- paste("leg", c, sep = "")
names(c.df)[names(c.df) == "nleg"] <- nlegname
}
c.df[[c]] <- NULL
Upvotes: 1
Reputation: 4537
So, if you're working with really large data, the goal should be to minimize work. In the example you gave, there's really only one full path, everything else is just a portion of that path (starting from 1). I assume that your data doesn't contain any loops (4 -> 3 -> 2 -> 4) because that would break this.
First, lets find all the unique starting points - these are the in
values that are never in the out
. There should be at least one of these if the non-looping condition I assume is true. We can also pull out all the other in
locations that do have an out
b = data.frame(In = c(1,2,4,5), Out = c(4,3,5,2))
onlyStarting = b$In[!(b$In %in% b$Out)]
allOthers = b$In[b$In %in% b$Out]
Now we want a function that can get the paths for the starting points. I wrote a recursive function that does this. It finds the next step and calls itself until there are no more steps.
getNextStep = function(IN){
nextStep = b$Out[b$In == IN]
if(length(nextStep) == 0) return(IN)
return(c(IN,getNextStep(nextStep)))
}
possiblePaths = lapply(onlyStarting,getNextStep)
#> [[1]]
#> [1] 1 4 5 2 3
We got the full path. Now we just need to find all the sub paths. We do this by checking each full path for the presence of the in
and then returning the portion of the subpath we need. This avoid lots of expensive recomputing that we don't need to bother with.
findMatch = function(IN,possiblePaths){
fullPath = possiblePaths[[which(sapply(possiblePaths,`%in%`,x=IN))[1]]]
partialPath = fullPath[which(fullPath == IN):length(fullPath)]
return(partialPath)
}
otherPaths = lapply(allOthers,findMatch,possiblePaths)
otherPaths
#> [[1]]
#> [1] 2 3
#>
#> [[2]]
#> [1] 4 5 2 3
#>
#> [[3]]
#> [1] 5 2 3
Upvotes: 1
Reputation: 33488
I liked the challenge of your question, so here is a not-so-elegant solution using base R
. You mentioned you are working with big datasets and this is going to rank amongst the slower solutions but I will share it anyway at least until other solutions arrive:
lines_list <- split(df, df$In)
for (i in 1:length(lines_list)) {
while (TRUE) {
n <- length(lines_list[[i]])
row <- which(lines_list[[i]][[n]] == df$In)
if (any(row)) {
lines_list[[i]][paste0("Out", n)] <- df$Out[row]
} else {
break
}
}
}
lines_list
$`1`
In Out Out2 Out3 Out4
1 1 4 5 2 3
$`2`
In Out
2 2 3
$`4`
In Out Out2 Out3
3 4 5 2 3
$`5`
In Out Out2
4 5 2 3
Or you could take it back to a data.frame
with something like:
data.table::rbindlist(lines_list, fill = TRUE)
In Out Out2 Out3 Out4
1: 1 4 5 2 3
2: 2 3 NA NA NA
3: 4 5 2 3 NA
4: 5 2 3 NA NA
Upvotes: 2
Reputation: 3923
without talking about efficiency, yes, you could do a recursive join starting like:
DF <- data.frame(In = c(1,2,4,5), Out = c(4,3,5,2))
dplyr::left_join(DF, DF, by = c("Out" = "In"))
# In Out Out.y
# 1 1 4 5
# 2 2 3 NA
# 3 4 5 2
# 4 5 2 3
and so on...
and later possibly re-shape into a list if you don't like the NAs
Upvotes: 1