KidLu
KidLu

Reputation: 223

Concatenate values across multiple rows for various IDs in R

My question is highly related to the following thread: concatenate values across two rows in R

The main difference is that I would like concatenate only those rows, which are of the same ID. So I need to include a grouping of some kind, but I wasn't able to do it.

# desired input
input <- data.frame(ID = c(1,1,1,3,3,3),
                   X1 = c("A", 1, 11, "D", 4, 44),
                   X2 = c("B", 2, 22, "E", 5, 55),
                   X3 = c("C", 3, 33, "F", 6, 66))

# desired output
output <- data.frame(ID = c(1,3),
                     X1 = c("A-1-11", "D-4-44"),
                     X2 = c("B-2-22", "E-5-55"),
                     X3 = c("C-3-33", "F-6-66"))

I tried the solution from the mentioned thread, but this concatenates all six rows:

output_v1 <- data.table::rbindlist(list(input, data.table::setDT(input)[, lapply(.SD, paste, collapse='-')]))

Obviously this does not work, since I am not grouping by ID. But in the documentation I do not find a way for grouping. Can anyone point me in the right direction?

Thanks a lot!

The question above was answered perfectly, however I noticed a second layer of complexity in my data:

# desired input
input2 <- data.frame(ID = c(1,1,1,3,3,3),
                    X1 = c("A", 1, 11, "D", 4, 44),
                    X2 = c("B", 2, 22, "E", 5, 55),
                    X3 = c("C", 3, 33, "F", 6, 66),
                    X4 = c("G", "G", "G", "H", 8, 88),
                    X5 = c("I", "I", "I", "J", "J", "J"),
                    X6 = c("K", "K", "0", "L", "L", "L"))

# desired output
output2 <- data.frame(ID = c(1,3),
                     X1 = c("A-1-11", "D-4-44"),
                     X2 = c("B-2-22", "E-5-55"),
                     X3 = c("C-3-33", "F-6-66"),
                     X4 = c("G", "H-8-88"),
                     X5 = c("I", "J"),
                     X6 = c("K-K-0", "L"))

Sometimes a column is completly identical within one ID. In this case I do not want to concatenate the same value multiple times, but rather have it once.

I tried the following to identify columns with differences within one ID - those columns I'd like to concatenate:

changes <- input2 |> 
  group_by(ID) |> 
  mutate(across(everything(), ~n_distinct(.x) > 1)) |> 
  pivot_longer(-ID, names_to = "col", values_to = "changed") |> 
  filter(changed) |> 
  select(-changed) |> 
  distinct()

Then I can treat the two cases differently:

data_concat <- input2 |>
  as_tibble() |>
  group_by(ID) |>
  select(changes$col) |>
  summarise(across(everything(), list(function(col) str_flatten(col, ", "))))

data_unique <- input2 |> 
  dplyr::select(!all_of(changes$col)) |>
  dplyr::distinct() 

data_new <- data_unique |>
  left_join(data_concat, by = 'ID')

However this only works for column X5, where every entry within one ID is duplicated. How I can treat X$ and X6 correctly I wasn't able to figure out yet. Any suggestions?

Additional Information: If the value is completely unique within one column and one ID, then it should become only one. If thats not the case it should be concatenated. So: KKKKK -> "K", KKKK0 -> "K-K-K-K-0", 5MMM5 -> "5-M-M-M-5", GGG99 -> "G-G-G-9-9" etc.

P.S.: I can create an additional question if it is not considered good style to enlarge the scope of a question. If that's the case, please comment. The first part was perfectly answered already.

Upvotes: 2

Views: 746

Answers (3)

Jan Z
Jan Z

Reputation: 171

With tidyverse:

library(tidyverse)
input %>% as_tibble() %>% group_by(ID) %>% summarise(across(everything(), list(function(col) str_flatten(col, '-'))))

returns:

# A tibble: 2 × 4
     ID X1_1   X2_1   X3_1  
  <dbl> <chr>  <chr>  <chr> 
1     1 A-1-11 B-2-22 C-3-33
2     3 D-4-44 E-5-55 F-6-66

Edit for ouput 2

input2 %>% as_tibble() %>% group_by(ID) %>% 
    summarise(across(everything(), ~if_else(length(unique(.))==1, str_flatten(unique(.), '-'), str_flatten(., '-'))))

returns:

# A tibble: 2 × 7
     ID X1     X2     X3     X4     X5    X6   
  <dbl> <chr>  <chr>  <chr>  <chr>  <chr> <chr>
1     1 A-1-11 B-2-22 C-3-33 G      I     K-K-0
2     3 D-4-44 E-5-55 F-6-66 H-8-88 J     L   

Upvotes: 2

akrun
akrun

Reputation: 886948

Or with data.table

library(data.table)
setDT(input)[, lapply(.SD, paste, collapse='-'), by = ID]
   ID     X1     X2     X3
1:  1 A-1-11 B-2-22 C-3-33
2:  3 D-4-44 E-5-55 F-6-66

Upvotes: 2

Jilber Urbina
Jilber Urbina

Reputation: 61154

We can use dplyr functions:

library(dplyr)
input %>% 
  group_by(ID) %>% 
  mutate(across(everything(), ~paste0(.,collapse = "-"))) %>% 
  slice(1)
# A tibble: 2 × 4
# Groups:   ID [2]
     ID X1     X2     X3    
  <dbl> <chr>  <chr>  <chr> 
1     1 A-1-11 B-2-22 C-3-33
2     3 D-4-44 E-5-55 F-6-66

Upvotes: 4

Related Questions