PGSA
PGSA

Reputation: 3081

Create a list column of unique values from other columns in a grouped tibble

I have a tibble in which I have, among others, some character columns and a single list column containing additional character in only some rows, otherwise blank. Each time this analysis is used, the number of character columns may change, but the naming remains consistent - they always start C (e.g. C1L, C1H etc).

Example dataset:

exemplar = tibble(group = c(rep("group1", 5), rep("group2",5)),
                  char1 = letters[1:10],
                  char2 = letters[2:11],
                  char3 = letters[3:12],
                  extra = list("",
                               "",
                               "",
                               "",
                               "",
                               list("x","y"),
                               "",
                               "",
                               "",
                               "")
                  )

But in 'live' use there could be 2, 4, 6, however many "char#" columns.

I'm trying to add a list column containing a list or character vector of all unique values in the char+extra columns. I can do it rowwise, but I need to do it for each group. Example desired output:

# A tibble: 10 x 6
   group  char1 char2 char3 extra      unique   
   <chr>  <chr> <chr> <chr> <list>     <list>   
 1 group1 a     b     c     <chr [1]>  <chr [7]>
 2 group1 b     c     d     <chr [1]>  <chr [7]>
 3 group1 c     d     e     <chr [1]>  <chr [7]>
 4 group1 d     e     f     <chr [1]>  <chr [7]>
 5 group1 e     f     g     <chr [1]>  <chr [7]>
 6 group2 f     g     h     <list [2]> <chr [9]>
 7 group2 g     h     i     <chr [1]>  <chr [9]>
 8 group2 h     i     j     <chr [1]>  <chr [9]>
 9 group2 i     j     k     <chr [1]>  <chr [9]>
10 group2 j     k     l     <chr [1]>  <chr [9]>

each 'group 1' row has c("a", "b", "c", "d", "e", "f", "g") in the unique column. each 'group 2' row has c("f", "g", "h", "i", "j", "k", "l", "x", "y") in the unique column.

Things I have tried:

> exemplar %>% group_by(group) %>% mutate(unique = unique(select(., starts_with("c"), "extra")))

Adding missing grouping variables: `group`
Error: Problem with `mutate()` input `unique`.
x Input `unique` can/'t be recycled to size 5.

> exemplar %>% 
+   mutate(unique = pmap_chr(
+     .l = select(., starts_with("c"), "extra"),
+     .f = function(...) unique(...)
+   )
+ )

Error: Problem with `mutate()` input `unique`.
x argument "x" is missing, with no default


> exemplar %>% 
+   mutate(unique = pmap_chr(
+     .l = select(., starts_with("c"), "extra"),
+     .f = function(...) unique(...)
+   )
+ )
 Error: Problem with `mutate()` input `unique`.
x argument "x" is missing, with no default

I feel like I've completely missed a simple way to do this - should I be unlisting the 'extra' column earlier? Given the variable number of columns to include, should I be creating a selection vector to use instead of using select(., ...) to do it in-place? Do I need to do it in a few steps then discard the unneeded columns?

Upvotes: 2

Views: 1422

Answers (1)

Vincent Guillemot
Vincent Guillemot

Reputation: 3429

I think the following code might do what you ask. The trick is to combine the values from different columns, some of them containing characters, and the other one containing "lists". So the first steps are to extract the information from both types of columns as vectors (with c_across(starts_with("c")) and unlist(extra)) ads combine them into a vector on which you will be able to work.

exemplar %>%
  group_by(group) %>%
  mutate(unique = list( # Makes sure that the new column is a "list"
    unique(             # Get the "unique" values
      c(                # Combine results from two types of columns
        c_across(       # First extract the "char" columns into a vector 
          starts_with("c")),
        unlist(extra))  # Then extract the "extra" column into a vector
    )
  )) %>%
  ungroup()

The result of this command is the following

# A tibble: 10 × 6
   group  char1 char2 char3 extra      unique    
   <chr>  <chr> <chr> <chr> <list>     <list>    
 1 group1 a     b     c     <chr [1]>  <chr [8]> 
 2 group1 b     c     d     <chr [1]>  <chr [8]> 
 3 group1 c     d     e     <chr [1]>  <chr [8]> 
 4 group1 d     e     f     <chr [1]>  <chr [8]> 
 5 group1 e     f     g     <chr [1]>  <chr [8]> 
 6 group2 f     g     h     <list [2]> <chr [10]>
 7 group2 g     h     i     <chr [1]>  <chr [10]>
 8 group2 h     i     j     <chr [1]>  <chr [10]>
 9 group2 i     j     k     <chr [1]>  <chr [10]>
10 group2 j     k     l     <chr [1]>  <chr [10]>

For group 1, the result is

[[1]]
[1] "a" "b" "c" "d" "e" "f" "g" "" 

And for group 1,

[[1]]
 [1] "f" "g" "h" "i" "j" "k" "l" "x" "y" "" 

Upvotes: 1

Related Questions