Reputation: 10996
Let's assume the following data:
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.",
"Accounting", "Accounting", "Research", "Research", "IT", "IT",
"IT"), child = c("Accounting", "Research", "IT", "New Software",
"New Accounting Standards", "New Product Line", "New Labs", "Outsource",
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e",
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")
parent child misc
1 Acme Inc. Accounting a
2 Acme Inc. Research b
3 Acme Inc. IT c
4 Accounting New Software d
5 Accounting New Accounting Standards e
6 Research New Product Line f
7 Research New Labs g
8 IT Outsource h
9 IT Go agile i
10 IT Switch to R j
I can now convert that to a tree structure with the data.tree
package.
my_tree <- data.tree::FromDataFrameNetwork(x)
What I actually want to get is the level information, more or less in wide format, which I can theoretically get via
my_data <- data.tree::ToDataFrameTypeCol(my_tree)
which gives:
level_1 level_2 level_3
1 Acme Inc. Accounting New Software
2 Acme Inc. Accounting New Accounting Standards
3 Acme Inc. Research New Product Line
4 Acme Inc. Research New Labs
5 Acme Inc. IT Outsource
6 Acme Inc. IT Go agile
7 Acme Inc. IT Switch to R
However, as you can see, this output has fewer rows than the original data (7 instead of 10). This is because the function only gives me the final leaves, if I see that correctly. But what I want is for each row in the original data frame, enhance the data with the full level information for that particular child. E.g. we know that "Accounting" is level 2, so I want to put that information as new columns to the original data.
The expected outcome would be:
parent child misc level_1 level_2 level_3
1 Acme Inc. Accounting a Acme Inc. Accounting NA
2 Acme Inc. Research b Acme Inc. Research NA
3 Acme Inc. IT c Acme Inc. IT NA
4 Accounting New Software d Acme Inc. Accounting New Software
5 Accounting New Accounting Standards e Acme Inc. Accounting New Accounting Standards
6 Research New Product Line f Acme Inc. Research New Product Line
7 Research New Labs g Acme Inc. Research New Labs
8 IT Outsource h Acme Inc. IT Outsource
9 IT Go agile i Acme Inc. IT Go agile
10 IT Switch to R j Acme Inc. IT Switch to R
I'm stuck here how to achieve this. Any idea?
Upvotes: 0
Views: 31
Reputation: 10996
I decided for a solution completely circumventing a tree structure (which also boosts performance by a factor of 10000 or so in my real-life use case).
What I'm doing is essentially creating a loop where in each loop I match the parent to its child and then move one level up until there's no more parent left.
I then do some tidy cleanup of the data, so that it hast the structure and order I want.
using the example data from my initial question, I do:
library(tidyverse)
still_open <- nrow(x)
i <- 2
x2 <- x |>
mutate(level_1 = child)
while (still_open != 0)
{
x2 <- x2 |>
mutate("level_{i}" := parent[match(.data[[paste0("level_", i - 1)]], child)], .after = .data[[paste0("level_", i - 1)]])
still_open <- x2 |>
pull(paste0("level_", i)) |>
na.omit() |>
length()
i <- i + 1
}
x2 <- x2 |>
pivot_longer(cols = starts_with("level_")) |>
filter(!is.na(value)) |>
mutate(value = rev(value), .by = child) |>
pivot_wider(names_from = name,
values_from = value)
which gives:
# A tibble: 10 × 6
parent child misc level_1 level_2 level_3
<chr> <chr> <chr> <chr> <chr> <chr>
1 Acme Inc. Accounting a Acme Inc. Accounting NA
2 Acme Inc. Research b Acme Inc. Research NA
3 Acme Inc. IT c Acme Inc. IT NA
4 Accounting New Software d Acme Inc. Accounting New Software
5 Accounting New Accounting Standards e Acme Inc. Accounting New Accounting Standards
6 Research New Product Line f Acme Inc. Research New Product Line
7 Research New Labs g Acme Inc. Research New Labs
8 IT Outsource h Acme Inc. IT Outsource
9 IT Go agile i Acme Inc. IT Go agile
10 IT Switch to R j Acme Inc. IT Switch to R
Upvotes: 0
Reputation: 21757
This is likely not the most elegant solution, but it does seem to work. What you want is to first join the two data together on the keys parent == level_2
and child == level_3
. Next, you can join the result with the tree data with the keys parent == level_1
and child == level_2
. This will join in the remaining observations that weren't matched in the first join. You can coalesce the different level_1
and level_2
variables to consolidate the information across the joins. Finally, using distinct()
will eliminate the duplicates that were produced during the joins.
library(dplyr)
library(data.tree)
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.",
"Accounting", "Accounting", "Research", "Research", "IT", "IT",
"IT"), child = c("Accounting", "Research", "IT", "New Software",
"New Accounting Standards", "New Product Line", "New Labs", "Outsource",
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e",
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")
my_tree <- data.tree::FromDataFrameNetwork(x)
my_data <- data.tree::ToDataFrameTypeCol(my_tree)
left_join(x, my_data, join_by(parent==level_2, child==level_3), keep=TRUE) %>%
left_join(my_data %>% select(level_1, level_2), join_by(parent==level_1, child==level_2), keep=TRUE) %>%
mutate(level_1 = coalesce(level_1.x, level_1.y),
level_2 = coalesce(level_2.x, level_2.y)) %>%
select(parent:misc, level_1, level_2, level_3) %>%
distinct()
#> parent child misc level_1 level_2
#> 1 Acme Inc. Accounting a Acme Inc. Accounting
#> 2 Acme Inc. Research b Acme Inc. Research
#> 3 Acme Inc. IT c Acme Inc. IT
#> 4 Accounting New Software d Acme Inc. Accounting
#> 5 Accounting New Accounting Standards e Acme Inc. Accounting
#> 6 Research New Product Line f Acme Inc. Research
#> 7 Research New Labs g Acme Inc. Research
#> 8 IT Outsource h Acme Inc. IT
#> 9 IT Go agile i Acme Inc. IT
#> 10 IT Switch to R j Acme Inc. IT
#> level_3
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 New Software
#> 5 New Accounting Standards
#> 6 New Product Line
#> 7 New Labs
#> 8 Outsource
#> 9 Go agile
#> 10 Switch to R
Created on 2023-06-01 with reprex v2.0.2
Upvotes: 1