Reputation: 467
I have two data frames, competitor_sheet
and left_join_prototype
, which I'll call df1 and df2, respectively, df1 and df2 each contain information on genes and disease IDs, and look like this:
df1:
HUGO_symbol | MeSH_ID |
---|---|
P53 | D000310 |
A1BG | D0002277 |
ZZZ3 | D000230 |
df2:
Gene.Name | Parent.MeSH.ID | Child.MeSH.ID |
---|---|---|
P53 | D000310 | D015675, D006676 |
HGA2 | D031031 | D002277 |
ZZZ3 | D001163, D000230 | D003451 |
As you can see, df2 may contain multiple IDs in both the Parent and Child MeSH ID's. Essentially, what I want to do is left join any rows from df2 to df1 if any of the MeSH ID's in either the child or parent correspond to MeSH ID's in the df1, but only if the Gene.Name = HUGO_symbol, the final product should look like this:
HUGO_symbol | MeSH_ID | Gene.Name | Parent.Mesh.ID | Child.MeSH.ID |
---|---|---|---|---|
P53 | D000310 | P53 | D000310 | D015675, D006676 |
ZZZ3 | D000230 | ZZZ3 | D001163, D00230 | D003451 |
I've tried something using dplyr that looks something like this:
proto <- left_join(df1,df2, by = c("MeSH_ID"="Parent.MeSH.ID", "HUGO_Symbol"="Gene.Name", "MeSH_ID2"="Child.MeSH.ID"))
where proto
would be the new table but I have a feeling this is wrong.
As always, any help is hugely appreciated :)
Upvotes: 0
Views: 550
Reputation: 125897
The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could
Parent.MeSH.ID
and Child.MeSH.ID
into a new column MeSH_ID
tidyr::separate_rows
. Doing so makes it possible to join the df's by ID.semi_join
to filter out rows in df1 with matches in the newly created df3, finally do a left_join
to add the columns from df3. Or if doesn't matter to keep both HUGO_symbol
and Gene.Name
you could achieve both steps with an inner_join
.df1 <- data.frame(
stringsAsFactors = FALSE,
HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
MeSH_ID = c("D000310", "D0002277", "D000230")
)
df2 <- data.frame(
stringsAsFactors = FALSE,
Gene.Name = c("P53", "HGA2", "ZZZ3"),
Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)
library(dplyr)
library(tidyr)
df3 <- df2 %>%
unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>%
separate_rows(MeSH_ID, sep = ", ")
semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>%
left_join(df3)
#> Joining, by = "MeSH_ID"
#> HUGO_symbol MeSH_ID Gene.Name Parent.MeSH.ID Child.MeSH.ID
#> 1 P53 D000310 P53 D000310 D015675, D006676
#> 2 ZZZ3 D000230 ZZZ3 D001163, D000230 D003451
Upvotes: 2