Reputation: 41

Creating nested data frames with variable dimensions

I have a data frame in which one column keys describes the format of all remaining columns. In the example below there are 2 such value-columns, but in general there may be many more.

library(tidyverse)

dat = tribble(
  ~id, ~keys,    ~vals1,   ~vals2,
  1,    "A/B",   "1/2",   "11/12",
  3,    "C/D/E", "6/7/8", "16"
)

I would like to transform these columns into a single column of nested data frames: in each row the values should be split on "/" and form the rows of a data frame, with headers taken from the keys entry.

Entries in the value columns may be truncated, in which case NA's should be used for the missing values (i.e., the entry "16" in the example should be interpreted as "16/NA/NA".)

The following code produces the wanted column for this particular case:

res = dat %>%
  mutate_at(vars(keys:last_col()), str_split, pattern = fixed("/")) %>%
  mutate(df = pmap(select(., keys:last_col()),
                   ~ bind_rows(setNames(..2, ..1[1:length(..2)]),
                               setNames(..3, ..1[1:length(..3)]))))
res$df
#> [[1]]
#> # A tibble: 2 x 2
#>   A     B    
#>   <chr> <chr>
#> 1 1     2    
#> 2 11    12   
#> 
#> [[2]]
#> # A tibble: 2 x 3
#>   C     D     E    
#>   <chr> <chr> <chr>
#> 1 6     7     8    
#> 2 16    <NA>  <NA>

My question is how to generalise to larger (and unknown) numbers of columns. Also, my use of setNames feels rather clumsy, and I was hoping for something a bit more elegant.

I am primarily looking for a tidyverse solution, but other approaches are welcome.

Update

I should have emphasised that the output I'm looking for is a single data frame, with columns id (unchanged) and df (a list of nested data frames).

(The original keys/values columns are not important; they may be removed.)

Here is the wanted structure in the above example:

str(res %>% select(id, df))
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ id: num  1 3
#>  $ df:List of 2
#>   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  2 variables:
#>   .. ..$ A: chr  "1" "11"
#>   .. ..$ B: chr  "2" "12"
#>   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  3 variables:
#>   .. ..$ C: chr  "6" "16"
#>   .. ..$ D: chr  "7" NA
#>   .. ..$ E: chr  "8" NA

Upvotes: 2

Answers (3)

Magnus

Reputation: 41

Here is an improvement of my own original attempt, which at least works for any number of columns.

After defining a small utility function,

set_names_pad = function(x, y) {
  length(x) = length(y)
  setNames(x, y)
}

the following pmap-based code gives the wanted result:

dat %>%
  mutate_at(vars(keys:last_col()), str_split, pattern = fixed("/"))  %>%
  mutate_at(vars(matches("val")), ~ map2(., keys, set_names_pad)) %>%
  mutate(df = pmap(select(., matches("val")), bind_rows))
#> # A tibble: 2 x 5
#>      id keys      vals1     vals2     df              
#>   <dbl> <list>    <list>    <list>    <list>          
#> 1     1 <chr [2]> <chr [2]> <chr [2]> <tibble [2 x 2]>
#> 2     3 <chr [3]> <chr [3]> <chr [3]> <tibble [2 x 3]>

This seems to perform reasonably well when the input has very many rows. Here's a comparison against two of @IceCreamToucan's suggestions:

# pmap solution
g = function(x) {
  x %>%
    mutate_at(vars(keys:last_col()), str_split, pattern = fixed("/"))  %>%
    mutate_at(vars(matches("val")), ~ map2(., keys, set_names_pad)) %>%
    mutate(df = pmap(select(., matches("val")), bind_rows))
}

# IceCreamToucan I
f1 = function(x) { 
  x %>% 
  mutate(df =  apply(.[-1], 1, function(x)
    data.table::fread(paste(x, collapse = '\n'), sep = '/', fill = TRUE)))
}

# IceCreamToucan II
f2 = function(x) {
  x %>%
    mutate(df = lapply(do.call(paste, c(.[-1], sep = '\n')),
                       data.table::fread, sep = '/', fill = TRUE))
}


bench::mark(f1(dat), f2(dat), g(dat), check = F)
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f1(dat)      1.87ms   1.94ms     483.     1.93MB     9.38
#> 2 f2(dat)      1.59ms   1.66ms     573.    34.79KB    11.0 
#> 3 g(dat)       9.26ms   9.56ms      98.2   15.13KB    12.3

# Increase to 10,000 rows
dat2 = list(dat) %>% rep(5000) %>% bind_rows %>% mutate(id = row_number())

bench::mark(f1(dat2), f2(dat2), g(dat2), check = F)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f1(dat2)      5.58s    5.58s     0.179     164MB     2.87
#> 2 f2(dat2)      4.88s    4.88s     0.205     163MB     3.07
#> 3 g(dat2)    407.51ms 422.89ms     2.36      484KB     5.91

# Increase to 50,000 rows
dat3 = list(dat) %>% rep(25000) %>% bind_rows %>% mutate(id = row_number())

bench::mark(f1(dat3), f2(dat3), g(dat3), check = F)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f1(dat3)     30.56s   30.56s    0.0327   825.7MB     1.64
#> 2 f2(dat3)     26.84s   26.84s    0.0373   816.7MB     1.49
#> 3 g(dat3)       3.63s    3.63s    0.275      2.3MB     2.20

I still have a feeling this operation could be done more elegantly using the pivoting functions of tidyr, though.

Upvotes: 2

IceCreamToucan

Reputation: 28695

For each row, you can convert the last 3 columns into a single character element where the column values are separated by newline characters. Then you essentially have a csv but with /s instead of commas, so you can then use a read.table or something to read it. I used data.table::fread because of its fill option, but there may be a way to do this with read_table or read.table as well.

res <- 
  dat %>% 
    mutate(df =  apply(dat[-1], 1, function(x)
                    data.table::fread(paste(x, collapse = '\n'), 
                                      sep = '/', fill = TRUE)))

res$df

# [[1]]
#     A  B
# 1:  1  2
# 2: 11 12
# 
# [[2]]
#     C  D  E
# 1:  6  7  8
# 2: 16 NA NA

Here's another option. Same output and same idea mostly, but apply is not used so a temporary (potentially large) matrix isn't created. The code is a little less clear though.

res <- 
  dat %>% 
    mutate(df =  lapply(do.call(paste, c(dat[-1], sep = '\n')),
                        data.table::fread, sep = '/', fill = TRUE))


res$df
# [[1]]
#     A  B
# 1:  1  2
# 2: 11 12
# 
# [[2]]
#     C  D  E
# 1:  6  7  8
# 2: 16 NA NA

You can also use split as below

split(dat[-1], dat[1]) %>% 
  map(~ fread(paste0(.x, collapse="\n"), sep="/", fill = TRUE))

# $`1`
#     A  B
# 1:  1  2
# 2: 11 12
# 
# $`3`
#     C  D  E
# 1:  6  7  8
# 2: 16 NA NA

Upvotes: 3

akrun

Reputation: 887213

Here is another option after reshaping

library(dplyr)
library(tidyr)
library(purrr)
dat %>% 
  pivot_longer(matches("vals\\d+")) %>% 
  select(-id) %>% 
  pivot_wider(names_from = keys, values_from = value) %>% 
  select(-name) %>%
  split.default(seq_along(.)) %>%
  map(~ .x %>% 
           separate(names(.), into = str_split(names(.), fixed("/")) %>% 
                unlist, sep="[/]"))

Upvotes: 3

Creating nested data frames with variable dimensions

Update

Answers (3)

Related Questions