Atakan
Atakan

Reputation: 436

Join data frames without creating duplicate rows while concatenating unique entries under one column

I'm trying to merge two data frames together which are related to each other via a specific variable named patient. The second data frame has multiple entries for the same patient column. I don't want to create duplicate patient entries upon merging, but I want to retain unique information in the second data frame by concatenating the values under one column.

I tried manually concatenating certain variables using group_by which works. I have several variables, however, and manually specifying all of them is not feasible

I can also concatenate every variable in the data frame by using dplyr as seen below. The problem in the second case is that duplicate values are also concatenated making the data frame unnecessarily big and difficult to deal with. Please see the reprex below.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df1 <- data.frame(patient=c("a", "b", "c"), 
                  var1 = 1:3, 
                  var2=11:13)

df1
#>   patient var1 var2
#> 1       a    1   11
#> 2       b    2   12
#> 3       c    3   13



df2 <- data.frame(patient=c("a","a", "b", "b", "c", "c" ), 
                  treatment=  rep(c("drug1", "drug2"), 3), 
                  time= rep(c("time1", "time2"), 3),
                  var3= "constant")

df2
#>   patient treatment  time     var3
#> 1       a     drug1 time1 constant
#> 2       a     drug2 time2 constant
#> 3       b     drug1 time1 constant
#> 4       b     drug2 time2 constant
#> 5       c     drug1 time1 constant
#> 6       c     drug2 time2 constant


df_merged <- left_join(df1, df2)
#> Joining, by = "patient"

# Don't want duplicates like this
df_merged
#>   patient var1 var2 treatment  time     var3
#> 1       a    1   11     drug1 time1 constant
#> 2       a    1   11     drug2 time2 constant
#> 3       b    2   12     drug1 time1 constant
#> 4       b    2   12     drug2 time2 constant
#> 5       c    3   13     drug1 time1 constant
#> 6       c    3   13     drug2 time2 constant



df_merged2 <- df_merged %>%
  group_by(patient) %>%
  mutate(treatment = paste(treatment, collapse = "_"),
         time=paste(time, collapse = "_")) %>%
  filter(!duplicated(patient))

# I can manually edit a few variables like this
df_merged2
#> # A tibble: 3 x 6
#> # Groups:   patient [3]
#>   patient  var1  var2 treatment   time        var3    
#>   <fct>   <int> <int> <chr>       <chr>       <fct>   
#> 1 a           1    11 drug1_drug2 time1_time2 constant
#> 2 b           2    12 drug1_drug2 time1_time2 constant
#> 3 c           3    13 drug1_drug2 time1_time2 constant


df_merged3 <- df_merged %>%
  group_by(patient) %>%
  mutate_at(vars(-group_cols()), .funs = ~paste(., collapse ="_")) %>%
  filter(!duplicated(patient))


# I have many variables I can't specify manually
# I can create this merged data frame, but I don't want to 
# concatenate duplicated values such as var1, var2, and var3
df_merged3
#> # A tibble: 3 x 6
#> # Groups:   patient [3]
#>   patient var1  var2  treatment   time        var3             
#>   <fct>   <chr> <chr> <chr>       <chr>       <chr>            
#> 1 a       1_1   11_11 drug1_drug2 time1_time2 constant_constant
#> 2 b       2_2   12_12 drug1_drug2 time1_time2 constant_constant
#> 3 c       3_3   13_13 drug1_drug2 time1_time2 constant_constant

Created on 2019-10-23 by the reprex package (v0.3.0)

I'd like to see if there is a way of concatenating variables containing only unique values to retain information from the second data frame without duplicating the rows in the df_merged.

I would be happy to hear if you have recommendations other than dplyr. A data.table solution may also be suitable for me as well, since my real data frames are quite large.

Thanks!

Upvotes: 1

Views: 882

Answers (1)

A. Suliman
A. Suliman

Reputation: 13125

We can use summarise_at and unique

library(dplyr)
df_merged %>%
          group_by(patient) %>%
          summarise_at(vars(-group_cols()), .funs = ~paste(unique(.), collapse ="_"))

Or we can do the merge/joint directly instead of adding/altering the Global Env with an intermediate dataframe.

left_join(df1, 
          df2 %>% group_by(patient) %>% 
                  summarise_at(vars(-group_cols()), .funs = ~paste(unique(.), collapse ="_")) %>% 
                  ungroup()
         )

Joining, by = "patient"
  patient var1 var2   treatment        time     var3
1       a    1   11 drug1_drug2 time1_time2 constant
2       b    2   12 drug1_drug2 time1_time2 constant
3       c    3   13 drug1_drug2 time1_time2 constant

Update

#Here a toy example to experiment with, uncomment browser to see how it works inside Reduce, 
#also see ?Reduce for more info 
paste_mod <- function(x) Reduce(function(u, v){
u <- ifelse(!grepl('_',u) & is.na(u),'.',u)
v <- ifelse(is.na(v),'.',v)
if(v=='.' | !grepl(v,u)) paste0(u,'_',v) else u
}, x)

paste_mod(c("drug1",NA,NA,"drug2","drug1","drug2"))
[1] "drug1_._._drug2"
paste_mod(c(NA,NA,"drug2","drug1","drug2"))
[1] "._._drug2_drug1"

#replace NA with . then apply Reduce
df2 %>% 
       mutate_if(is.factor,as.character) %>% mutate_all(~replace(.,is.na(.),'.')) %>% 
       group_by(patient) %>% 
       summarise_at(vars(-group_cols()), .funs = ~Reduce(function(u, v) if(v=='.' | !grepl(v,u)) paste0(u,'_',v) else u, .)) %>% 
       ungroup()

# A tibble: 2 x 4
  patient treatment       time        var3    
  <chr>   <chr>           <chr>       <chr>   
1 a       drug1_._._drug2 time1_time2 constant
2 c       drug1_drug2     time1_time2 constant

New df2 for testing the updated solution

df2 <- structure(list(patient = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("a", 
"c"), class = "factor"), treatment = structure(c(1L, NA, NA, 
2L, 1L, 2L), .Label = c("drug1", "drug2"), class = "factor"), 
    time = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("time1", 
    "time2"), class = "factor"), var3 = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L), class = "factor", .Label = "constant")), class = "data.frame", row.names = c(NA, 
-6L))

Upvotes: 2

Related Questions