Michael Byars
Michael Byars

Reputation: 117

R recursive join not getting desired results

I am working on recreating my Excel model in R due to the limited amount of data that Excel can process. The meat of the model takes two columns of flight data (inbound leg, outbound leg) and builds out lines of flight by matching the outbound leg to the inbound leg, placing the new outbound leg on the original line and repeating this process until there is no more inbound match for the outbound leg. Here is the working VBA code for this process.

    For i = f To l
        If i Mod 100 = 0 Then Application.StatusBar = "Progress: Step 4 of 18 - Building lines for " & ref.Cells(a, 39) & " (" & (a - 3) & " A/C types of " & (g - 3) & "), Line " & i - f & " of " & l - f & ")"
        DoEvents

    y = 0
    b = 0

        x = .Cells(i, 2)

        y = Application.Match(.Cells(i, 2), LegTable, 0)
        j = FirstTurn(y, 1)
        If .Cells(i, 2) <> FirstTurn(y, 1) Then GoTo Nexti

        NextLeg = NextLeg + 1
        ReDim Preserve NextTurn(0, 1 To NextLeg)
        NextTurn(0, NextLeg) = FirstTurn(y, 2)

            Do
                FTtext = FirstTurn(y, 2)
                On Error GoTo errhdlr
                b = Application.Match(FTtext, LegTable, 1)
                If FTtext <> FirstTurn(b, 1) Then GoTo Nexti

                NextLeg = NextLeg + 1
                ReDim Preserve NextTurn(0, 1 To NextLeg)
                NextTurn(0, NextLeg) = FirstTurn(b, 2)
                y = b
            Loop

errhdlr:
    Resume Nexti
Nexti:

    If NextLeg > 0 Then Range(.Cells(i, 3), .Cells(i, NextLeg + 2)).Value = NextTurn
    Erase NextTurn
    NextLeg = 0

    Next i

Sample data would be

In  Out
1   4
2   3
4   5
5   2

Output would be

1 4 5 2 3
2 3
4 5 2 3
5 2 3

In R, I have the following code

## Build Lines of Flight
  b.list <- list(a = data.frame(leg1, leg2), b = data.frame((leg2)))
  c.data <- join_all(b.list, by = leg2, type = "full", match = "all")

All this gives me is the original two columns. Thanks for the assistance.

Upvotes: 1

Views: 542

Answers (4)

Michael Byars
Michael Byars

Reputation: 117

Thank you Mark, Roland, and Snoram, for the suggestions. While trying to get those to work, I created a solution that works with my data. Not sure how efficient it truly is, but it ran through 128K rows (ended up being 248 total columns wide) in less than 6 seconds so I can't complain (my excel model would take over 5 minutes for the same dataset). Thanks again for the assistance. Here is my code:

## Build Lines of Flight
nr <- nrow(b.data)
c <- 2
c.df <- b.df
nlegname <- paste("leg", c, sep = "")
y <- match(leg2, leg1)

while(all(is.na(y)) == FALSE)
  {
y <- match(c.df[[c]], leg1)
d <- all(is.na(y))
nl <- b.df[y,"leg2"]
c.df <- add_column(c.df, nleg = nl)
c <-c+1
nlegname <- paste("leg", c, sep = "")
names(c.df)[names(c.df) == "nleg"] <- nlegname
}
c.df[[c]] <- NULL

Upvotes: 1

Mark
Mark

Reputation: 4537

So, if you're working with really large data, the goal should be to minimize work. In the example you gave, there's really only one full path, everything else is just a portion of that path (starting from 1). I assume that your data doesn't contain any loops (4 -> 3 -> 2 -> 4) because that would break this.

First, lets find all the unique starting points - these are the in values that are never in the out. There should be at least one of these if the non-looping condition I assume is true. We can also pull out all the other in locations that do have an out

b = data.frame(In = c(1,2,4,5), Out = c(4,3,5,2))

onlyStarting = b$In[!(b$In %in% b$Out)]
allOthers = b$In[b$In %in% b$Out]

Now we want a function that can get the paths for the starting points. I wrote a recursive function that does this. It finds the next step and calls itself until there are no more steps.

getNextStep = function(IN){
  nextStep = b$Out[b$In == IN]
  if(length(nextStep) == 0) return(IN)
  return(c(IN,getNextStep(nextStep)))
}
possiblePaths = lapply(onlyStarting,getNextStep)
#> [[1]]
#> [1] 1 4 5 2 3

We got the full path. Now we just need to find all the sub paths. We do this by checking each full path for the presence of the in and then returning the portion of the subpath we need. This avoid lots of expensive recomputing that we don't need to bother with.

findMatch = function(IN,possiblePaths){
  fullPath = possiblePaths[[which(sapply(possiblePaths,`%in%`,x=IN))[1]]]
  partialPath = fullPath[which(fullPath == IN):length(fullPath)]
  return(partialPath)
}
otherPaths = lapply(allOthers,findMatch,possiblePaths)
otherPaths
#> [[1]]
#> [1] 2 3
#> 
#> [[2]]
#> [1] 4 5 2 3
#> 
#> [[3]]
#> [1] 5 2 3

Upvotes: 1

s_baldur
s_baldur

Reputation: 33488

I liked the challenge of your question, so here is a not-so-elegant solution using base R. You mentioned you are working with big datasets and this is going to rank amongst the slower solutions but I will share it anyway at least until other solutions arrive:

lines_list <- split(df, df$In)
for (i in 1:length(lines_list)) {
  while (TRUE) {
    n <- length(lines_list[[i]])
    row <- which(lines_list[[i]][[n]] == df$In)
    if (any(row)) {
      lines_list[[i]][paste0("Out", n)] <- df$Out[row]
    } else {
      break
    }
  }
}
lines_list
$`1`
  In Out Out2 Out3 Out4
1  1   4    5    2    3

$`2`
  In Out
2  2   3

$`4`
  In Out Out2 Out3
3  4   5    2    3

$`5`
  In Out Out2
4  5   2    3

Or you could take it back to a data.frame with something like:

data.table::rbindlist(lines_list, fill = TRUE)  
   In Out Out2 Out3 Out4
1:  1   4    5    2    3
2:  2   3   NA   NA   NA
3:  4   5    2    3   NA
4:  5   2    3   NA   NA

Upvotes: 2

RolandASc
RolandASc

Reputation: 3923

without talking about efficiency, yes, you could do a recursive join starting like:

DF <- data.frame(In = c(1,2,4,5), Out = c(4,3,5,2))

dplyr::left_join(DF, DF, by = c("Out" = "In"))

#   In Out Out.y
# 1  1   4     5
# 2  2   3    NA
# 3  4   5     2
# 4  5   2     3

and so on... and later possibly re-shape into a list if you don't like the NAs

Upvotes: 1

Related Questions