DiamondJoe12
DiamondJoe12

Reputation: 1817

Merging tables in R using multiple fields in left table, one field in right table

I have two dataframes that I'm trying to merge. The first dataframe (hospital_df) contains a case ID and then a diagnosis codes for diagnosis 1, 2, 3, 4, etc etc. The second dataframe is a description of each unique diagnosis code, matching the code to a description:

id <- c(1, 1, 5, 5, 5, 40, 40, 99, 99, 99, 99)
principal_diagnosis <- c('AA', 'CS', 'AA', 'EG', 'SD', 'DF', 'DD', 'AA', 'DF', 'HG', 'SD')
other_diagnosis <- c('AB', 'CF', 'AD', 'AA', 'SS', 'DS', 'DF', 'AG', 'AB', 'FG', 'VV')
other_diagnosis2 <- c('XX', 'DD', 'SQ', 'SD', 'DF', 'CV', 'DA', 'FD', 'SS', 'WV', 'AA')
hospital_df <- data.frame(id,principal_diagnosis, other_diagnosis, other_diagnosis2)

principal_diagnosis_codes <- c('AA', 'CS', 'AA', 'EG', 'SD', 'DF', 'DD', 'AA', 'DF', 'HG', 'SD', 'AB', 'CF', 'AD', 'SS', 'DS', 'AG', 'FG', 'VV', 'XX', 'SQ', 'CV', 'DA', 'FD', 'WV')
length(principal_diagnosis_codes)
description <- c('disease1',
                 'disease2',
                 'disease3',
                 'disease4',
                 'disease5',
                 'disease6',
                 'disease7',
                 'disease8',
                 'disease9',
                 'disease10',
                 'disease11',
                 'disease12',
                 'disease13',
                 'disease14',
                 'disease15',
                 'disease16',
                 'disease17',
                 'disease18',
                 'disease19',
                 'disease20',
                 'disease21',
                 'disease22',
                 'disease23',
                 'disease24',
                 'disease25')

diagnosis_codes_df <- data.frame(principal_diagnosis_codes, description)

I would like the output to look like:

ID  principal_diagnosis     other_diagnosis    other_diagnosis2
1   disease1                disease12          disease20
1   disease2                disease13          disease7
5   disease1                disease14          disease21

I've tried:

    mergedData <- merge(hospital_df, diagnosis_codes_df, by.x=c("principal_diagnosis"),
                        by.y=c("principal_diagnosis_codes"))
    
   mergedData <- merge(mergedData, diagnosis_codes_df, by.x=c("other_diagnosis"),
                        by.y=c("principal_diagnosis_codes"))

mergedData <- merge(mergedData, diagnosis_codes_df, by.x=c("other_diagnosis2"),
                        by.y=c("principal_diagnosis_codes"))

But this seems inefficient and also gives me description, description.y, description.x, etc which is slightly confusing. Is there a better way? I'm a newcomer to R.

Upvotes: 0

Views: 59

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 389235

We can use match :

library(dplyr)

hospital_df %>% mutate(across(-1,~diagnosis_codes_df$description[
                match(.x, diagnosis_codes_df$principal_diagnosis_codes)])) 



#   id principal_diagnosis other_diagnosis other_diagnosis2
#1   1            disease1       disease12        disease20
#2   1            disease2       disease13         disease7
#3   5            disease1       disease14        disease21
#4   5            disease4        disease1         diseases
#5   5            disease5       disease15         disease6
#6  40            disease6       disease16        disease22
#7  40            disease7        disease6        disease23
#8  99            disease1       disease17        disease24
#9  99            disease6       disease12        disease15
#10 99           disease10       disease18        disease25
#11 99            disease5       disease19         disease1

Same using base R :

hospital_df[-1] <- lapply(hospital_df[-1], function(x) 
            diagnosis_codes_df$description[match(x, 
            diagnosis_codes_df$principal_diagnosis_codes)]) 

Upvotes: 1

akrun
akrun

Reputation: 887851

We could deframe to create a named vector and use that to replace multiple columns in the 'hospital_df' with mutate/across. Here, we use R 4.0 where data.frame stringsAsFactors = FALSE by default

library(dplyr) # 1.0.0
library(tibble)
nm1 <- deframe(diagnosis_codes_df)
hospital_df %>% 
    mutate(across(contains('diagnosis'), ~ nm1[.x]))
#    id principal_diagnosis other_diagnosis other_diagnosis2
#1   1            disease1       disease12        disease20
#2   1            disease2       disease13         disease7
#3   5            disease1       disease14        disease21
#4   5            disease4        disease1         disease5
#5   5            disease5       disease15         disease6
#6  40            disease6       disease16        disease22
#7  40            disease7        disease6        disease23
#8  99            disease1       disease17        disease24
#9  99            disease6       disease12        disease15
#10 99           disease10       disease18        disease25
#11 99            disease5       disease19         disease1

Or using base R, construct a named vector ('nm1'), loop over the 'diagnosis' columns of 'hospital_df' (lapply), use the named vector to match and replace the values, and assign it back to the columns of the dataset

nm1 <- with(diagnosis_codes_df, setNames(description, principal_diagnosis_codes))
hospital_df[-1] <- lapply(hospital_df[-1], function(x) nm1[x])

Or convert to matrix and then do a match

hospital_df[-1] <-  nm1[as.matrix(hospital_df[-1])]

Upvotes: 1

Daniel O
Daniel O

Reputation: 4358

this code in Base-R uses the first diagnosis code in the list that matches.

hospital_df[-1] <- sapply(unlist(hospital_df[-1]), function(x) diagnosis_codes_df$description[x == diagnosis_codes_df][1])


   id principal_diagnosis other_diagnosis other_diagnosis2
1   1            disease1       disease12        disease20
2   1            disease2       disease13         disease7
3   5            disease1       disease14        disease21
4   5            disease4        disease1         disease5
5   5            disease5       disease15         disease6
6  40            disease6       disease16        disease22
7  40            disease7        disease6        disease23
8  99            disease1       disease17        disease24
9  99            disease6       disease12        disease15
10 99           disease10       disease18        disease25
11 99            disease5       disease19         disease1

Upvotes: 1

Related Questions