deschen
deschen

Reputation: 10996

Backconvert tree to data frame structure by keeping original rows

Let's assume the following data:

x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.", 
"Accounting", "Accounting", "Research", "Research", "IT", "IT", 
"IT"), child = c("Accounting", "Research", "IT", "New Software", 
"New Accounting Standards", "New Product Line", "New Labs", "Outsource", 
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e", 
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")

       parent                    child misc
1   Acme Inc.               Accounting    a
2   Acme Inc.                 Research    b
3   Acme Inc.                       IT    c
4  Accounting             New Software    d
5  Accounting New Accounting Standards    e
6    Research         New Product Line    f
7    Research                 New Labs    g
8          IT                Outsource    h
9          IT                 Go agile    i
10         IT              Switch to R    j

I can now convert that to a tree structure with the data.tree package.

my_tree <- data.tree::FromDataFrameNetwork(x)

What I actually want to get is the level information, more or less in wide format, which I can theoretically get via

my_data <- data.tree::ToDataFrameTypeCol(my_tree)

which gives:

    level_1    level_2                  level_3
1 Acme Inc. Accounting             New Software
2 Acme Inc. Accounting New Accounting Standards
3 Acme Inc.   Research         New Product Line
4 Acme Inc.   Research                 New Labs
5 Acme Inc.         IT                Outsource
6 Acme Inc.         IT                 Go agile
7 Acme Inc.         IT              Switch to R

However, as you can see, this output has fewer rows than the original data (7 instead of 10). This is because the function only gives me the final leaves, if I see that correctly. But what I want is for each row in the original data frame, enhance the data with the full level information for that particular child. E.g. we know that "Accounting" is level 2, so I want to put that information as new columns to the original data.

The expected outcome would be:

       parent                    child misc   level_1    level_2                  level_3
1   Acme Inc.               Accounting    a Acme Inc. Accounting                       NA
2   Acme Inc.                 Research    b Acme Inc.   Research                       NA
3   Acme Inc.                       IT    c Acme Inc.         IT                       NA
4  Accounting             New Software    d Acme Inc. Accounting             New Software
5  Accounting New Accounting Standards    e Acme Inc. Accounting New Accounting Standards
6    Research         New Product Line    f Acme Inc.   Research         New Product Line
7    Research                 New Labs    g Acme Inc.   Research                 New Labs
8          IT                Outsource    h Acme Inc.         IT                Outsource
9          IT                 Go agile    i Acme Inc.         IT                 Go agile
10         IT              Switch to R    j Acme Inc.         IT              Switch to R

I'm stuck here how to achieve this. Any idea?

Upvotes: 0

Views: 31

Answers (2)

deschen
deschen

Reputation: 10996

I decided for a solution completely circumventing a tree structure (which also boosts performance by a factor of 10000 or so in my real-life use case).

What I'm doing is essentially creating a loop where in each loop I match the parent to its child and then move one level up until there's no more parent left.

I then do some tidy cleanup of the data, so that it hast the structure and order I want.

using the example data from my initial question, I do:

library(tidyverse)    
still_open <- nrow(x)
i <- 2
x2 <- x |> 
  mutate(level_1 = child)

while (still_open != 0)
{
  x2 <- x2 |> 
    mutate("level_{i}" := parent[match(.data[[paste0("level_", i - 1)]], child)], .after = .data[[paste0("level_", i - 1)]])
  
  still_open <- x2 |> 
    pull(paste0("level_", i)) |> 
    na.omit() |> 
    length()
  
  i <- i + 1
}

x2 <- x2 |> 
  pivot_longer(cols = starts_with("level_")) |> 
  filter(!is.na(value)) |> 
  mutate(value = rev(value), .by = child) |> 
  pivot_wider(names_from  = name,
              values_from = value)

which gives:

# A tibble: 10 × 6
   parent     child                    misc  level_1   level_2    level_3                 
   <chr>      <chr>                    <chr> <chr>     <chr>      <chr>                   
 1 Acme Inc.  Accounting               a     Acme Inc. Accounting NA                      
 2 Acme Inc.  Research                 b     Acme Inc. Research   NA                      
 3 Acme Inc.  IT                       c     Acme Inc. IT         NA                      
 4 Accounting New Software             d     Acme Inc. Accounting New Software            
 5 Accounting New Accounting Standards e     Acme Inc. Accounting New Accounting Standards
 6 Research   New Product Line         f     Acme Inc. Research   New Product Line        
 7 Research   New Labs                 g     Acme Inc. Research   New Labs                
 8 IT         Outsource                h     Acme Inc. IT         Outsource               
 9 IT         Go agile                 i     Acme Inc. IT         Go agile                
10 IT         Switch to R              j     Acme Inc. IT         Switch to R    

Upvotes: 0

DaveArmstrong
DaveArmstrong

Reputation: 21757

This is likely not the most elegant solution, but it does seem to work. What you want is to first join the two data together on the keys parent == level_2 and child == level_3. Next, you can join the result with the tree data with the keys parent == level_1 and child == level_2. This will join in the remaining observations that weren't matched in the first join. You can coalesce the different level_1 and level_2 variables to consolidate the information across the joins. Finally, using distinct() will eliminate the duplicates that were produced during the joins.

library(dplyr)
library(data.tree)
  
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.", 
                               "Accounting", "Accounting", "Research", "Research", "IT", "IT", 
                               "IT"), child = c("Accounting", "Research", "IT", "New Software", 
                                                "New Accounting Standards", "New Product Line", "New Labs", "Outsource", 
                                                "Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e", 
                                                                                     "f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")

my_tree <- data.tree::FromDataFrameNetwork(x)
my_data <- data.tree::ToDataFrameTypeCol(my_tree)


left_join(x, my_data, join_by(parent==level_2, child==level_3), keep=TRUE) %>% 
  left_join(my_data %>% select(level_1, level_2), join_by(parent==level_1, child==level_2), keep=TRUE) %>%
  mutate(level_1 = coalesce(level_1.x, level_1.y), 
         level_2 = coalesce(level_2.x, level_2.y)) %>% 
  select(parent:misc, level_1, level_2, level_3) %>% 
  distinct()
#>        parent                    child misc   level_1    level_2
#> 1   Acme Inc.               Accounting    a Acme Inc. Accounting
#> 2   Acme Inc.                 Research    b Acme Inc.   Research
#> 3   Acme Inc.                       IT    c Acme Inc.         IT
#> 4  Accounting             New Software    d Acme Inc. Accounting
#> 5  Accounting New Accounting Standards    e Acme Inc. Accounting
#> 6    Research         New Product Line    f Acme Inc.   Research
#> 7    Research                 New Labs    g Acme Inc.   Research
#> 8          IT                Outsource    h Acme Inc.         IT
#> 9          IT                 Go agile    i Acme Inc.         IT
#> 10         IT              Switch to R    j Acme Inc.         IT
#>                     level_3
#> 1                      <NA>
#> 2                      <NA>
#> 3                      <NA>
#> 4              New Software
#> 5  New Accounting Standards
#> 6          New Product Line
#> 7                  New Labs
#> 8                 Outsource
#> 9                  Go agile
#> 10              Switch to R

Created on 2023-06-01 with reprex v2.0.2

Upvotes: 1

Related Questions