Reputation: 1817
I have two dataframes that I'm trying to merge. The first dataframe (hospital_df) contains a case ID and then a diagnosis codes for diagnosis 1, 2, 3, 4, etc etc. The second dataframe is a description of each unique diagnosis code, matching the code to a description:
id <- c(1, 1, 5, 5, 5, 40, 40, 99, 99, 99, 99)
principal_diagnosis <- c('AA', 'CS', 'AA', 'EG', 'SD', 'DF', 'DD', 'AA', 'DF', 'HG', 'SD')
other_diagnosis <- c('AB', 'CF', 'AD', 'AA', 'SS', 'DS', 'DF', 'AG', 'AB', 'FG', 'VV')
other_diagnosis2 <- c('XX', 'DD', 'SQ', 'SD', 'DF', 'CV', 'DA', 'FD', 'SS', 'WV', 'AA')
hospital_df <- data.frame(id,principal_diagnosis, other_diagnosis, other_diagnosis2)
principal_diagnosis_codes <- c('AA', 'CS', 'AA', 'EG', 'SD', 'DF', 'DD', 'AA', 'DF', 'HG', 'SD', 'AB', 'CF', 'AD', 'SS', 'DS', 'AG', 'FG', 'VV', 'XX', 'SQ', 'CV', 'DA', 'FD', 'WV')
length(principal_diagnosis_codes)
description <- c('disease1',
'disease2',
'disease3',
'disease4',
'disease5',
'disease6',
'disease7',
'disease8',
'disease9',
'disease10',
'disease11',
'disease12',
'disease13',
'disease14',
'disease15',
'disease16',
'disease17',
'disease18',
'disease19',
'disease20',
'disease21',
'disease22',
'disease23',
'disease24',
'disease25')
diagnosis_codes_df <- data.frame(principal_diagnosis_codes, description)
I would like the output to look like:
ID principal_diagnosis other_diagnosis other_diagnosis2
1 disease1 disease12 disease20
1 disease2 disease13 disease7
5 disease1 disease14 disease21
I've tried:
mergedData <- merge(hospital_df, diagnosis_codes_df, by.x=c("principal_diagnosis"),
by.y=c("principal_diagnosis_codes"))
mergedData <- merge(mergedData, diagnosis_codes_df, by.x=c("other_diagnosis"),
by.y=c("principal_diagnosis_codes"))
mergedData <- merge(mergedData, diagnosis_codes_df, by.x=c("other_diagnosis2"),
by.y=c("principal_diagnosis_codes"))
But this seems inefficient and also gives me description, description.y, description.x, etc which is slightly confusing. Is there a better way? I'm a newcomer to R.
Upvotes: 0
Views: 59
Reputation: 389235
We can use match
:
library(dplyr)
hospital_df %>% mutate(across(-1,~diagnosis_codes_df$description[
match(.x, diagnosis_codes_df$principal_diagnosis_codes)]))
# id principal_diagnosis other_diagnosis other_diagnosis2
#1 1 disease1 disease12 disease20
#2 1 disease2 disease13 disease7
#3 5 disease1 disease14 disease21
#4 5 disease4 disease1 diseases
#5 5 disease5 disease15 disease6
#6 40 disease6 disease16 disease22
#7 40 disease7 disease6 disease23
#8 99 disease1 disease17 disease24
#9 99 disease6 disease12 disease15
#10 99 disease10 disease18 disease25
#11 99 disease5 disease19 disease1
Same using base R :
hospital_df[-1] <- lapply(hospital_df[-1], function(x)
diagnosis_codes_df$description[match(x,
diagnosis_codes_df$principal_diagnosis_codes)])
Upvotes: 1
Reputation: 887851
We could deframe
to create a named vector and use that to replace multiple columns in the 'hospital_df' with mutate/across
. Here, we use R 4.0
where data.frame
stringsAsFactors = FALSE
by default
library(dplyr) # 1.0.0
library(tibble)
nm1 <- deframe(diagnosis_codes_df)
hospital_df %>%
mutate(across(contains('diagnosis'), ~ nm1[.x]))
# id principal_diagnosis other_diagnosis other_diagnosis2
#1 1 disease1 disease12 disease20
#2 1 disease2 disease13 disease7
#3 5 disease1 disease14 disease21
#4 5 disease4 disease1 disease5
#5 5 disease5 disease15 disease6
#6 40 disease6 disease16 disease22
#7 40 disease7 disease6 disease23
#8 99 disease1 disease17 disease24
#9 99 disease6 disease12 disease15
#10 99 disease10 disease18 disease25
#11 99 disease5 disease19 disease1
Or using base R
, construct a named vector ('nm1'), loop over the 'diagnosis' columns of 'hospital_df' (lapply
), use the named vector to match and replace the values, and assign it back to the columns of the dataset
nm1 <- with(diagnosis_codes_df, setNames(description, principal_diagnosis_codes))
hospital_df[-1] <- lapply(hospital_df[-1], function(x) nm1[x])
Or convert to matrix
and then do a match
hospital_df[-1] <- nm1[as.matrix(hospital_df[-1])]
Upvotes: 1
Reputation: 4358
this code in Base-R uses the first diagnosis code in the list that matches.
hospital_df[-1] <- sapply(unlist(hospital_df[-1]), function(x) diagnosis_codes_df$description[x == diagnosis_codes_df][1])
id principal_diagnosis other_diagnosis other_diagnosis2
1 1 disease1 disease12 disease20
2 1 disease2 disease13 disease7
3 5 disease1 disease14 disease21
4 5 disease4 disease1 disease5
5 5 disease5 disease15 disease6
6 40 disease6 disease16 disease22
7 40 disease7 disease6 disease23
8 99 disease1 disease17 disease24
9 99 disease6 disease12 disease15
10 99 disease10 disease18 disease25
11 99 disease5 disease19 disease1
Upvotes: 1