JasonAizkalns
JasonAizkalns

Reputation: 20463

How to flatten a hierarchical data structure given parent child relationships in R

I have data that describes parent-child relationships:

df <- tibble::tribble(
       ~Child,     ~Parent,
      "Fruit",      "Food",
  "Vegetable",      "Food",
      "Apple",     "Fruit",
     "Banana",     "Fruit",
       "Pear",     "Fruit",
     "Carrot", "Vegetable",
     "Celery", "Vegetable",
       "Bike",  "Not Food",
        "Car",  "Not Food"
  )
df
#> # A tibble: 9 x 2
#>   Child     Parent   
#>   <chr>     <chr>    
#> 1 Fruit     Food     
#> 2 Vegetable Food     
#> 3 Apple     Fruit    
#> 4 Banana    Fruit    
#> 5 Pear      Fruit    
#> 6 Carrot    Vegetable
#> 7 Celery    Vegetable
#> 8 Bike      Not Food 
#> 9 Car       Not Food

Visually, this looks like:

Visual

Ultimately, my desired results are to "flatten" this to a structure that looks more like this:

results <- tibble::tribble(
             ~Level.03, ~Level.02,  ~Level.01,
               "Apple",   "Fruit",     "Food",
              "Banana",   "Fruit",     "Food",
                "Pear",   "Fruit",     "Food",
                    NA,    "Bike", "Not Food",
                    NA,     "Car", "Not Food"
             )
results
#> # A tibble: 5 x 3
#>   Level.03 Level.02 Level.01
#>   <chr>    <chr>    <chr>   
#> 1 Apple    Fruit    Food    
#> 2 Banana   Fruit    Food    
#> 3 Pear     Fruit    Food    
#> 4 <NA>     Bike     Not Food
#> 5 <NA>     Car      Not Food

NOTE: Not all of the elements will have all of the levels. For example, bike and car do not have Level.03 elements.

I feel like there's a way to do this elegantly with tidyr or some type of next/unnest function from jsonlite? I started with a recursive join, but I feel like I'm re-inventing the wheel and there's likely a straight-forward approach.

Upvotes: 2

Views: 1733

Answers (3)

camille
camille

Reputation: 16842

I would think about it like a graph problem. There are 2 changes to make to the original data to fit this approach: switch the order of columns to show the hierarchical direction (parent to child), and add a top-level node (I'm calling it "Items") that links to the major groups (food & not food). You could probably do that second part programmatically but it seems like more of a pain than it's worth.

library(dplyr)

df <- tibble::tribble(
  ~Child,     ~Parent,
  "Fruit",      "Food",
  "Vegetable",      "Food",
  "Apple",     "Fruit",
  "Banana",     "Fruit",
  "Pear",     "Fruit",
  "Carrot", "Vegetable",
  "Celery", "Vegetable",
  "Bike",  "Not Food",
  "Car",  "Not Food"
) %>%
  select(Parent, Child) %>%
  add_row(Parent = "Items", Child = c("Food", "Not Food"))

The first method is with data.tree, which is designed to work with this type of data. It creates a tree representation, which you can then convert back to a data frame with one of a few shapes.

library(data.tree)

g1 <- FromDataFrameNetwork(df)
g1
#>             levelName
#> 1  Items             
#> 2   ¦--Food          
#> 3   ¦   ¦--Fruit     
#> 4   ¦   ¦   ¦--Apple 
#> 5   ¦   ¦   ¦--Banana
#> 6   ¦   ¦   °--Pear  
#> 7   ¦   °--Vegetable 
#> 8   ¦       ¦--Carrot
#> 9   ¦       °--Celery
#> 10  °--Not Food      
#> 11      ¦--Bike      
#> 12      °--Car
ToDataFrameTypeCol(g1)
#>   level_1  level_2   level_3 level_4
#> 1   Items     Food     Fruit   Apple
#> 2   Items     Food     Fruit  Banana
#> 3   Items     Food     Fruit    Pear
#> 4   Items     Food Vegetable  Carrot
#> 5   Items     Food Vegetable  Celery
#> 6   Items Not Food      Bike    <NA>
#> 7   Items Not Food       Car    <NA>

The second method is more convoluted and probably only makes sense if there are other graph operations you need to do. Make a graph with igraph, then get all the paths in the graph starting from the top node Items. That gives you a list of vertex objects; for each of those, extract IDs. One example of those is below.

library(igraph)
g2 <- graph_from_data_frame(df)
all_simple_paths(g2, from = "Items") %>%
  purrr::map(as_ids) %>%
  `[[`(4)
#> [1] "Items"  "Food"   "Fruit"  "Banana"

Create data frames from all those vectors, bind, and reshape to get one column per level.

all_simple_paths(g2, from = "Items") %>%
  purrr::map(as_ids) %>%
  purrr::map_dfr(tibble::enframe, .id = "row") %>%
  tidyr::pivot_wider(id_cols = row, names_prefix = "level_")
#> # A tibble: 11 × 5
#>    row   level_1 level_2  level_3   level_4
#>    <chr> <chr>   <chr>    <chr>     <chr>  
#>  1 1     Items   Food     <NA>      <NA>   
#>  2 2     Items   Food     Fruit     <NA>   
#>  3 3     Items   Food     Fruit     Apple  
#>  4 4     Items   Food     Fruit     Banana 
#>  5 5     Items   Food     Fruit     Pear   
#>  6 6     Items   Food     Vegetable <NA>   
#>  7 7     Items   Food     Vegetable Carrot 
#>  8 8     Items   Food     Vegetable Celery 
#>  9 9     Items   Not Food <NA>      <NA>   
#> 10 10    Items   Not Food Bike      <NA>   
#> 11 11    Items   Not Food Car       <NA>

In either case, drop the level 1 column if you don't actually want it.

Upvotes: 3

Onyambu
Onyambu

Reputation: 79238

Here is a function with while loop:

fun <- function(s){
  i <- 1
  while(i<=length(s)){
    if(any(s[[i]] %in% names(s)))
    {
      nms <- s[[i]]
      s[[i]] <- stack(s[nms])
      s[nms] <- NULL
    }
    else
      s[[i]] <- data.frame(values = NA, ind = s[[i]])
    i <- i+1
  }
  s
}

dplyr::bind_rows(fun(unstack(df)), .id = 'Level.01')[c(2:3,1)]
 values       ind Level.01
1  Apple     Fruit     Food
2 Banana     Fruit     Food
3   Pear     Fruit     Food
4 Carrot Vegetable     Food
5 Celery Vegetable     Food
6   <NA>      Bike Not Food
7   <NA>       Car Not Food

You could generalize this if you had more levels

Upvotes: 4

Martin Gal
Martin Gal

Reputation: 16988

In this special case you could get your desired result by doing some joining and binding:

library(dplyr)

df2 <- df %>% 
  inner_join(df, 
             by = c("Parent" = "Child"),
             suffix = c("", "_top")) 

df %>% 
  anti_join(df2) %>%
  select(Parent_top = Parent, Parent = Child) %>% 
  bind_rows(df2) %>%
  group_by(Parent_top, Parent) %>% 
  filter(!is.na(Child) | n() == 1) %>% 
  select(Level_01 = Parent_top, Level_02 = Parent, Level_03 = Child)

But I don't think that way is very stable for larger/other datasets. Perhaps just using a loop over this dataset gives you a better answer.

Upvotes: 2

Related Questions