Reputation: 436
I'm trying to merge two data frames together which are related to each other via a specific variable named patient
. The second data frame has multiple entries for the same patient column. I don't want to create duplicate patient entries upon merging, but I want to retain unique information in the second data frame by concatenating the values under one column.
I tried manually concatenating certain variables using group_by
which works. I have several variables, however, and manually specifying all of them is not feasible
I can also concatenate every variable in the data frame by using dplyr
as seen below. The problem in the second case is that duplicate values are also concatenated making the data frame unnecessarily big and difficult to deal with. Please see the reprex below.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df1 <- data.frame(patient=c("a", "b", "c"),
var1 = 1:3,
var2=11:13)
df1
#> patient var1 var2
#> 1 a 1 11
#> 2 b 2 12
#> 3 c 3 13
df2 <- data.frame(patient=c("a","a", "b", "b", "c", "c" ),
treatment= rep(c("drug1", "drug2"), 3),
time= rep(c("time1", "time2"), 3),
var3= "constant")
df2
#> patient treatment time var3
#> 1 a drug1 time1 constant
#> 2 a drug2 time2 constant
#> 3 b drug1 time1 constant
#> 4 b drug2 time2 constant
#> 5 c drug1 time1 constant
#> 6 c drug2 time2 constant
df_merged <- left_join(df1, df2)
#> Joining, by = "patient"
# Don't want duplicates like this
df_merged
#> patient var1 var2 treatment time var3
#> 1 a 1 11 drug1 time1 constant
#> 2 a 1 11 drug2 time2 constant
#> 3 b 2 12 drug1 time1 constant
#> 4 b 2 12 drug2 time2 constant
#> 5 c 3 13 drug1 time1 constant
#> 6 c 3 13 drug2 time2 constant
df_merged2 <- df_merged %>%
group_by(patient) %>%
mutate(treatment = paste(treatment, collapse = "_"),
time=paste(time, collapse = "_")) %>%
filter(!duplicated(patient))
# I can manually edit a few variables like this
df_merged2
#> # A tibble: 3 x 6
#> # Groups: patient [3]
#> patient var1 var2 treatment time var3
#> <fct> <int> <int> <chr> <chr> <fct>
#> 1 a 1 11 drug1_drug2 time1_time2 constant
#> 2 b 2 12 drug1_drug2 time1_time2 constant
#> 3 c 3 13 drug1_drug2 time1_time2 constant
df_merged3 <- df_merged %>%
group_by(patient) %>%
mutate_at(vars(-group_cols()), .funs = ~paste(., collapse ="_")) %>%
filter(!duplicated(patient))
# I have many variables I can't specify manually
# I can create this merged data frame, but I don't want to
# concatenate duplicated values such as var1, var2, and var3
df_merged3
#> # A tibble: 3 x 6
#> # Groups: patient [3]
#> patient var1 var2 treatment time var3
#> <fct> <chr> <chr> <chr> <chr> <chr>
#> 1 a 1_1 11_11 drug1_drug2 time1_time2 constant_constant
#> 2 b 2_2 12_12 drug1_drug2 time1_time2 constant_constant
#> 3 c 3_3 13_13 drug1_drug2 time1_time2 constant_constant
Created on 2019-10-23 by the reprex package (v0.3.0)
I'd like to see if there is a way of concatenating variables containing only unique values to retain information from the second data frame without duplicating the rows in the df_merged
.
I would be happy to hear if you have recommendations other than dplyr
. A data.table
solution may also be suitable for me as well, since my real data frames are quite large.
Thanks!
Upvotes: 1
Views: 882
Reputation: 13125
We can use summarise_at
and unique
library(dplyr)
df_merged %>%
group_by(patient) %>%
summarise_at(vars(-group_cols()), .funs = ~paste(unique(.), collapse ="_"))
Or we can do the merge/joint directly instead of adding/altering the Global Env with an intermediate dataframe.
left_join(df1,
df2 %>% group_by(patient) %>%
summarise_at(vars(-group_cols()), .funs = ~paste(unique(.), collapse ="_")) %>%
ungroup()
)
Joining, by = "patient"
patient var1 var2 treatment time var3
1 a 1 11 drug1_drug2 time1_time2 constant
2 b 2 12 drug1_drug2 time1_time2 constant
3 c 3 13 drug1_drug2 time1_time2 constant
#Here a toy example to experiment with, uncomment browser to see how it works inside Reduce,
#also see ?Reduce for more info
paste_mod <- function(x) Reduce(function(u, v){
u <- ifelse(!grepl('_',u) & is.na(u),'.',u)
v <- ifelse(is.na(v),'.',v)
if(v=='.' | !grepl(v,u)) paste0(u,'_',v) else u
}, x)
paste_mod(c("drug1",NA,NA,"drug2","drug1","drug2"))
[1] "drug1_._._drug2"
paste_mod(c(NA,NA,"drug2","drug1","drug2"))
[1] "._._drug2_drug1"
#replace NA with . then apply Reduce
df2 %>%
mutate_if(is.factor,as.character) %>% mutate_all(~replace(.,is.na(.),'.')) %>%
group_by(patient) %>%
summarise_at(vars(-group_cols()), .funs = ~Reduce(function(u, v) if(v=='.' | !grepl(v,u)) paste0(u,'_',v) else u, .)) %>%
ungroup()
# A tibble: 2 x 4
patient treatment time var3
<chr> <chr> <chr> <chr>
1 a drug1_._._drug2 time1_time2 constant
2 c drug1_drug2 time1_time2 constant
New df2
for testing the updated solution
df2 <- structure(list(patient = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("a",
"c"), class = "factor"), treatment = structure(c(1L, NA, NA,
2L, 1L, 2L), .Label = c("drug1", "drug2"), class = "factor"),
time = structure(c(1L, 2L, 1L, 2L, 1L, 2L), .Label = c("time1",
"time2"), class = "factor"), var3 = structure(c(1L, 1L, 1L,
1L, 1L, 1L), class = "factor", .Label = "constant")), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 2