Reputation: 215
I have a dataframe with in 1 column gene IDs (data1). In another dataframe I have the corresponding gene names (data2). Data1 also contains cells with multiple genenames, separated with ':', and also a lot of NAs. Preferably I want to add a column to data1 with the corresponding gene names, also separated by ':' if there are multiple. An alternative would be to replace all the genenames in data1 with the corresponding gene names. Any idea how to go about this? Thanks!
a <- c("ENSG00000150401:ENSG00000150403", "ENSG00000185294", "NA")
data1 <- data.frame(a)
b <- c("ENSG00000150401", "ENSG00000150403", "ENSG00000185294")
c <- c("GeneA", "GeneB", "GeneC")
data2 <- data.frame(b,c)
Upvotes: 2
Views: 254
Reputation: 887501
Here is another option with gsubfn
library(gsubfn)
data1$res <- gsubfn("\\w+", setNames(as.list(as.character(data2$c)),
data2$b), as.character(data1$a))
data1
# a res
#1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
#2 ENSG00000185294 GeneC
#3 NA NA
In base R
, this can be also done by splitting the 'a' column with strsplit
and then do match with a named vector created from 'b', 'c' columns of second dataset
is.na(data1$a) <- data1$a == "NA" # converting to real NA instead of character
i1 <- !is.na(data1$a)
# create named vector
v1 <- setNames(as.character(data2$c), data2$b)
data1$res[i1] <- sapply(strsplit(as.character(data1$a[i1]), ":"),
function(x) paste(v1[x], collapse=":"))
Upvotes: 0
Reputation: 389115
We can get data1
in long format, left_join
data2
and paste values together.
library(dplyr)
data1 %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(a, sep = ":") %>%
left_join(data2, by = c('a' = 'b')) %>%
group_by(row) %>%
summarise(a = paste0(a, collapse = ":"),
c = paste0(c, collapse = ":")) %>%
select(-row)
# a c
# <chr> <chr>
#1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
#2 ENSG00000185294 GeneC
#3 NA NA
Upvotes: 0
Reputation: 40051
One option involving stringr
could be:
data1$res <- str_replace_all(data1$a, setNames(data2$c, data2$b))
a res
1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
2 ENSG00000185294 GeneC
3 NA NA
Upvotes: 3