Reputation: 41
I have a data frame in which one column keys
describes the format of all remaining columns. In the example below there are 2 such value-columns, but in general there may be many more.
library(tidyverse)
dat = tribble(
~id, ~keys, ~vals1, ~vals2,
1, "A/B", "1/2", "11/12",
3, "C/D/E", "6/7/8", "16"
)
I would like to transform these columns into a single column of nested data frames: in each row the values should be split on "/"
and form the rows of a data frame, with headers taken from the keys
entry.
Entries in the value columns may be truncated, in which case NA's should be used for the missing values (i.e., the entry "16"
in the example should be interpreted as "16/NA/NA"
.)
The following code produces the wanted column for this particular case:
res = dat %>%
mutate_at(vars(keys:last_col()), str_split, pattern = fixed("/")) %>%
mutate(df = pmap(select(., keys:last_col()),
~ bind_rows(setNames(..2, ..1[1:length(..2)]),
setNames(..3, ..1[1:length(..3)]))))
res$df
#> [[1]]
#> # A tibble: 2 x 2
#> A B
#> <chr> <chr>
#> 1 1 2
#> 2 11 12
#>
#> [[2]]
#> # A tibble: 2 x 3
#> C D E
#> <chr> <chr> <chr>
#> 1 6 7 8
#> 2 16 <NA> <NA>
My question is how to generalise to larger (and unknown) numbers of columns. Also, my use of setNames
feels rather clumsy, and I was hoping for something a bit more elegant.
I am primarily looking for a tidyverse solution, but other approaches are welcome.
I should have emphasised that the output I'm looking for is a single data frame, with columns id
(unchanged) and df
(a list of nested data frames).
(The original keys/values columns are not important; they may be removed.)
Here is the wanted structure in the above example:
str(res %>% select(id, df))
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables:
#> $ id: num 1 3
#> $ df:List of 2
#> ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables:
#> .. ..$ A: chr "1" "11"
#> .. ..$ B: chr "2" "12"
#> ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> .. ..$ C: chr "6" "16"
#> .. ..$ D: chr "7" NA
#> .. ..$ E: chr "8" NA
Upvotes: 2
Views: 775
Reputation: 41
Here is an improvement of my own original attempt, which at least works for any number of columns.
After defining a small utility function,
set_names_pad = function(x, y) {
length(x) = length(y)
setNames(x, y)
}
the following pmap
-based code gives the wanted result:
dat %>%
mutate_at(vars(keys:last_col()), str_split, pattern = fixed("/")) %>%
mutate_at(vars(matches("val")), ~ map2(., keys, set_names_pad)) %>%
mutate(df = pmap(select(., matches("val")), bind_rows))
#> # A tibble: 2 x 5
#> id keys vals1 vals2 df
#> <dbl> <list> <list> <list> <list>
#> 1 1 <chr [2]> <chr [2]> <chr [2]> <tibble [2 x 2]>
#> 2 3 <chr [3]> <chr [3]> <chr [3]> <tibble [2 x 3]>
This seems to perform reasonably well when the input has very many rows. Here's a comparison against two of @IceCreamToucan's suggestions:
# pmap solution
g = function(x) {
x %>%
mutate_at(vars(keys:last_col()), str_split, pattern = fixed("/")) %>%
mutate_at(vars(matches("val")), ~ map2(., keys, set_names_pad)) %>%
mutate(df = pmap(select(., matches("val")), bind_rows))
}
# IceCreamToucan I
f1 = function(x) {
x %>%
mutate(df = apply(.[-1], 1, function(x)
data.table::fread(paste(x, collapse = '\n'), sep = '/', fill = TRUE)))
}
# IceCreamToucan II
f2 = function(x) {
x %>%
mutate(df = lapply(do.call(paste, c(.[-1], sep = '\n')),
data.table::fread, sep = '/', fill = TRUE))
}
bench::mark(f1(dat), f2(dat), g(dat), check = F)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 f1(dat) 1.87ms 1.94ms 483. 1.93MB 9.38
#> 2 f2(dat) 1.59ms 1.66ms 573. 34.79KB 11.0
#> 3 g(dat) 9.26ms 9.56ms 98.2 15.13KB 12.3
# Increase to 10,000 rows
dat2 = list(dat) %>% rep(5000) %>% bind_rows %>% mutate(id = row_number())
bench::mark(f1(dat2), f2(dat2), g(dat2), check = F)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 f1(dat2) 5.58s 5.58s 0.179 164MB 2.87
#> 2 f2(dat2) 4.88s 4.88s 0.205 163MB 3.07
#> 3 g(dat2) 407.51ms 422.89ms 2.36 484KB 5.91
# Increase to 50,000 rows
dat3 = list(dat) %>% rep(25000) %>% bind_rows %>% mutate(id = row_number())
bench::mark(f1(dat3), f2(dat3), g(dat3), check = F)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 f1(dat3) 30.56s 30.56s 0.0327 825.7MB 1.64
#> 2 f2(dat3) 26.84s 26.84s 0.0373 816.7MB 1.49
#> 3 g(dat3) 3.63s 3.63s 0.275 2.3MB 2.20
I still have a feeling this operation could be done more elegantly using the pivoting functions of tidyr
, though.
Upvotes: 2
Reputation: 28695
For each row, you can convert the last 3 columns into a single character element where the column values are separated by newline characters. Then you essentially have a csv but with /
s instead of commas, so you can then use a read.table or something to read it. I used data.table::fread because of its fill
option, but there may be a way to do this with read_table or read.table as well.
res <-
dat %>%
mutate(df = apply(dat[-1], 1, function(x)
data.table::fread(paste(x, collapse = '\n'),
sep = '/', fill = TRUE)))
res$df
# [[1]]
# A B
# 1: 1 2
# 2: 11 12
#
# [[2]]
# C D E
# 1: 6 7 8
# 2: 16 NA NA
Here's another option. Same output and same idea mostly, but apply
is not used so a temporary (potentially large) matrix isn't created. The code is a little less clear though.
res <-
dat %>%
mutate(df = lapply(do.call(paste, c(dat[-1], sep = '\n')),
data.table::fread, sep = '/', fill = TRUE))
res$df
# [[1]]
# A B
# 1: 1 2
# 2: 11 12
#
# [[2]]
# C D E
# 1: 6 7 8
# 2: 16 NA NA
You can also use split
as below
split(dat[-1], dat[1]) %>%
map(~ fread(paste0(.x, collapse="\n"), sep="/", fill = TRUE))
# $`1`
# A B
# 1: 1 2
# 2: 11 12
#
# $`3`
# C D E
# 1: 6 7 8
# 2: 16 NA NA
Upvotes: 3
Reputation: 887213
Here is another option after reshaping
library(dplyr)
library(tidyr)
library(purrr)
dat %>%
pivot_longer(matches("vals\\d+")) %>%
select(-id) %>%
pivot_wider(names_from = keys, values_from = value) %>%
select(-name) %>%
split.default(seq_along(.)) %>%
map(~ .x %>%
separate(names(.), into = str_split(names(.), fixed("/")) %>%
unlist, sep="[/]"))
Upvotes: 3