Reputation: 467
I have two rows with ~4000 parent and child ID's for different cancers, I've created two columns next to them titled "Child_Name" and "Parent_Name". df1 is like so:
Parent ID | Child ID | Parent Name | Child Name |
---|---|---|---|
D015179 | D003110 | ||
D018307 | D002294 |
In another data frame, I have parent and child ID's in one row, and the names of them in the adjacent column:
Cancer_ID | Cancer |
---|---|
D015179 | Colorectal Neoplasms |
D018307 | Neoplasms, Squamous Cell |
D002294 | Carcinoma, Squamous Cell |
D003110 | Colonic Neoplasms |
Essentially, I want to fill in Parent Name and Child Name in df1 by finding the corresponding cancer_ID in the df2, and place the cancer name in either the child or parent name respectively, so that it looks like this :
Parent ID | Child ID | Parent Name | Child Name |
---|---|---|---|
D015179 | D003110 | Colorectal Neoplasms | Colonic Neoplasms |
D018307 | D002294 | Neoplasms, Squamous Cell | Carcinoma, Squamous Cell |
I believe there is likely a dplyr solution for this, but I haven't been able to come up with anything solid, as always, any help would be greatly appreciated!
These are dput()'s of the first 20 rows of df1 and df2 respectively that can be worked with directly (I hope I have presented these in the correct format):
df1:
structure(list(Parent_ID = c("D015179", "D015179", "D001932",
"D002528", "D018307", "D018307", "D003110", "D012004", "D015179",
"D015179", "D009442", "D009455", "D018358", "D018358", "D018295",
"D018295", "D001984", "D001984", "D010235", "D010235"), Child_ID = c("D003110",
"D003110", "D002528", "D001932", "D002294", "D002294", "D012004",
"D003110", "D012004", "D012004", "D009455", "D009442", "D018278",
"D018278", "D002280", "D002280", "D002283", "D002283", "D010236",
"D010236"), Child_Name = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Parent_Name = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA)), row.names = c(1L, 2L, 3L, 4L, 7L, 8L, 11L, 12L,
15L, 16L, 27L, 28L, 31L, 32L, 37L, 38L, 39L, 40L, 53L, 54L), class = "data.frame")
df2:
structure(list(Cancer_ID = c("D009369", "D003560", "D001845",
"D017824", "D007570", "D009631", "D009807", "D003803", "D018333",
"D010509", "D011842", "D001935", "D047688", "D001994", "D017043",
"D015529", "D004814", "D004934", "D005497", "D045888"), Cancer = c("Neoplasms",
"Cysts", "Bone Cysts", "Bone Cysts, Aneurysmal", "Jaw Cysts",
"Nonodontogenic Cysts", "Odontogenic Cysts", "Dentigerous Cyst",
"Odontogenic Cyst, Calcifying", "Periodontal Cyst", "Radicular Cyst",
"Branchioma", "Breast Cyst", "Bronchogenic Cyst", "Chalazion",
"Choledochal Cyst", "Epidermal Cyst", "Esophageal Cyst", "Follicular Cyst",
"Ganglion Cysts")), row.names = c(NA, 20L), class = "data.frame")
Upvotes: 1
Views: 151
Reputation: 26218
A slight variation of my friend's answer, using match()
and gsub()
in .names =
argument of across
df <- read.table(header = T, text = "Parent_ID Child_ID
D015179 D003110
D018307 D002294")
lookup <- read.table(header = T, text = "Cancer_ID Cancer
D015179 'Colorectal Neoplasms'
D018307 'Neoplasms, Squamous Cell'
D002294 'Carcinoma, Squamous Cell'
D003110 'Colonic Neoplasms'")
library(dplyr, warn.conflicts = F)
df %>% mutate(across(everything(), ~ lookup$Cancer[match(., lookup$Cancer_ID)],
.names = '{gsub("_ID", "_name", .col)}'))
#> Parent_ID Child_ID Parent_name Child_name
#> 1 D015179 D003110 Colorectal Neoplasms Colonic Neoplasms
#> 2 D018307 D002294 Neoplasms, Squamous Cell Carcinoma, Squamous Cell
Created on 2021-06-21 by the reprex package (v2.0.0)
Upvotes: 2