Reputation: 20463
I have data that describes parent-child relationships:
df <- tibble::tribble(
~Child, ~Parent,
"Fruit", "Food",
"Vegetable", "Food",
"Apple", "Fruit",
"Banana", "Fruit",
"Pear", "Fruit",
"Carrot", "Vegetable",
"Celery", "Vegetable",
"Bike", "Not Food",
"Car", "Not Food"
)
df
#> # A tibble: 9 x 2
#> Child Parent
#> <chr> <chr>
#> 1 Fruit Food
#> 2 Vegetable Food
#> 3 Apple Fruit
#> 4 Banana Fruit
#> 5 Pear Fruit
#> 6 Carrot Vegetable
#> 7 Celery Vegetable
#> 8 Bike Not Food
#> 9 Car Not Food
Visually, this looks like:
Ultimately, my desired results are to "flatten" this to a structure that looks more like this:
results <- tibble::tribble(
~Level.03, ~Level.02, ~Level.01,
"Apple", "Fruit", "Food",
"Banana", "Fruit", "Food",
"Pear", "Fruit", "Food",
NA, "Bike", "Not Food",
NA, "Car", "Not Food"
)
results
#> # A tibble: 5 x 3
#> Level.03 Level.02 Level.01
#> <chr> <chr> <chr>
#> 1 Apple Fruit Food
#> 2 Banana Fruit Food
#> 3 Pear Fruit Food
#> 4 <NA> Bike Not Food
#> 5 <NA> Car Not Food
NOTE: Not all of the elements will have all of the levels. For example, bike
and car
do not have Level.03
elements.
I feel like there's a way to do this elegantly with tidyr
or some type of next/unnest
function from jsonlite
? I started with a recursive join, but I feel like I'm re-inventing the wheel and there's likely a straight-forward approach.
Upvotes: 2
Views: 1733
Reputation: 16842
I would think about it like a graph problem. There are 2 changes to make to the original data to fit this approach: switch the order of columns to show the hierarchical direction (parent to child), and add a top-level node (I'm calling it "Items") that links to the major groups (food & not food). You could probably do that second part programmatically but it seems like more of a pain than it's worth.
library(dplyr)
df <- tibble::tribble(
~Child, ~Parent,
"Fruit", "Food",
"Vegetable", "Food",
"Apple", "Fruit",
"Banana", "Fruit",
"Pear", "Fruit",
"Carrot", "Vegetable",
"Celery", "Vegetable",
"Bike", "Not Food",
"Car", "Not Food"
) %>%
select(Parent, Child) %>%
add_row(Parent = "Items", Child = c("Food", "Not Food"))
The first method is with data.tree
, which is designed to work with this type of data. It creates a tree representation, which you can then convert back to a data frame with one of a few shapes.
library(data.tree)
g1 <- FromDataFrameNetwork(df)
g1
#> levelName
#> 1 Items
#> 2 ¦--Food
#> 3 ¦ ¦--Fruit
#> 4 ¦ ¦ ¦--Apple
#> 5 ¦ ¦ ¦--Banana
#> 6 ¦ ¦ °--Pear
#> 7 ¦ °--Vegetable
#> 8 ¦ ¦--Carrot
#> 9 ¦ °--Celery
#> 10 °--Not Food
#> 11 ¦--Bike
#> 12 °--Car
ToDataFrameTypeCol(g1)
#> level_1 level_2 level_3 level_4
#> 1 Items Food Fruit Apple
#> 2 Items Food Fruit Banana
#> 3 Items Food Fruit Pear
#> 4 Items Food Vegetable Carrot
#> 5 Items Food Vegetable Celery
#> 6 Items Not Food Bike <NA>
#> 7 Items Not Food Car <NA>
The second method is more convoluted and probably only makes sense if there are other graph operations you need to do. Make a graph with igraph
, then get all the paths in the graph starting from the top node Items. That gives you a list of vertex objects; for each of those, extract IDs. One example of those is below.
library(igraph)
g2 <- graph_from_data_frame(df)
all_simple_paths(g2, from = "Items") %>%
purrr::map(as_ids) %>%
`[[`(4)
#> [1] "Items" "Food" "Fruit" "Banana"
Create data frames from all those vectors, bind, and reshape to get one column per level.
all_simple_paths(g2, from = "Items") %>%
purrr::map(as_ids) %>%
purrr::map_dfr(tibble::enframe, .id = "row") %>%
tidyr::pivot_wider(id_cols = row, names_prefix = "level_")
#> # A tibble: 11 × 5
#> row level_1 level_2 level_3 level_4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Items Food <NA> <NA>
#> 2 2 Items Food Fruit <NA>
#> 3 3 Items Food Fruit Apple
#> 4 4 Items Food Fruit Banana
#> 5 5 Items Food Fruit Pear
#> 6 6 Items Food Vegetable <NA>
#> 7 7 Items Food Vegetable Carrot
#> 8 8 Items Food Vegetable Celery
#> 9 9 Items Not Food <NA> <NA>
#> 10 10 Items Not Food Bike <NA>
#> 11 11 Items Not Food Car <NA>
In either case, drop the level 1 column if you don't actually want it.
Upvotes: 3
Reputation: 79238
Here is a function with while loop:
fun <- function(s){
i <- 1
while(i<=length(s)){
if(any(s[[i]] %in% names(s)))
{
nms <- s[[i]]
s[[i]] <- stack(s[nms])
s[nms] <- NULL
}
else
s[[i]] <- data.frame(values = NA, ind = s[[i]])
i <- i+1
}
s
}
dplyr::bind_rows(fun(unstack(df)), .id = 'Level.01')[c(2:3,1)]
values ind Level.01
1 Apple Fruit Food
2 Banana Fruit Food
3 Pear Fruit Food
4 Carrot Vegetable Food
5 Celery Vegetable Food
6 <NA> Bike Not Food
7 <NA> Car Not Food
You could generalize this if you had more levels
Upvotes: 4
Reputation: 16988
In this special case you could get your desired result by doing some joining and binding:
library(dplyr)
df2 <- df %>%
inner_join(df,
by = c("Parent" = "Child"),
suffix = c("", "_top"))
df %>%
anti_join(df2) %>%
select(Parent_top = Parent, Parent = Child) %>%
bind_rows(df2) %>%
group_by(Parent_top, Parent) %>%
filter(!is.na(Child) | n() == 1) %>%
select(Level_01 = Parent_top, Level_02 = Parent, Level_03 = Child)
But I don't think that way is very stable for larger/other datasets. Perhaps just using a loop over this dataset gives you a better answer.
Upvotes: 2