Reputation: 2690
I have a table that looks as follows:
Table1 <- data.frame(
"Random" = c("A", "B", "C"),
"Genes" = c("Apple", "Candy", "Toothpaste"),
"Extra" = c("Up", "", "Down"),
"Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist")
)
giving:
Random Genes Extra Desc
1 A Apple Up Healthy,Red,Fruit
2 B Candy Sweet,Cavities,Sugar,Fruity
3 C Toothpaste Down Minty,Dentist
I have another table with Descriptions and want to add a columns with Genes. For instance Table2 would be:
Table2 <- data.frame(
"Col1" = c(1, 2, 3, 4, 5, 6),
"Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity")
)
Giving:
Col1 Desc
1 1 Sweet
2 2 Sugar
3 3 Dentist
4 4 Red
5 5 Fruit
6 6 Fruity
I want to add another column to Table2 named "Genes" that matches the "Desc" from both tables and adds the Genes from Table1 to get:
Col1 Desc Gene
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 6 Fruity Candy
Upvotes: 12
Views: 12937
Reputation: 10422
Assuming there aren't too many terms to match, here's an option using some tidyverse
functions:
library(tidyverse)
crossing(Table1, Table2) %>%
mutate_if(is.factor, as.character) %>%
rowwise() %>%
filter(str_detect(Desc, Desc1)) %>%
select(Col1, Desc = Desc1, Genes) %>%
arrange(Col1)
# A tibble: 7 x 3
Col1 Desc Genes
<dbl> <chr> <chr>
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 5 Fruit Candy
7 6 Fruity Candy
Upvotes: 0
Reputation: 109844
If we can get a key lookup into a named list or 2 vectors (e.g., 2 column data frame) we can use the %l%
function int he *qdapTools** package I maintain. First I'll split up your Table1$desc
into a named list using the strsplit
function. That's out key. The we can do the lookup via Table2$Desc
. This uses the *data.table** package in the backend so it's pretty speedy:
library(qdapTools)
key <- setNames(strsplit(as.character(Table1[["Desc"]]), "\\s*,\\s*"), Table1[["Genes"]])
## $Apple
## [1] "Healthy" "Red" "Fruit"
##
## $Candy
## [1] "Sweet" "Cavities" "Sugar" "Fruity"
##
## $Toothpaste
## [1] "Minty" "Dentist"
Table2[["Gene"]] <- Table2[["Desc"]] %l% key
## Col1 Desc Gene
## 1 1 Sweet Candy
## 2 2 Sugar Candy
## 3 3 Dentist Toothpaste
## 4 4 Red Apple
## 5 5 Fruit Apple
## 6 6 Fruity Candy
Here's a pure base vector lookup that should also be pretty speedy:
x <- strsplit(as.character(Table1[["Desc"]]), "\\s*,\\s*")
key <- setNames(rep(Table1[["Genes"]], sapply(x, length)), unlist(x))
Table2[["Gene"]] <- key[match(Table2[["Desc"]], names(key))]
Upvotes: 4
Reputation: 66819
Following @TylerRinker's answer, I'd format the Table1$Desc
string first:
Table1a <- with(Table1,
stack(setNames(sapply(as.character(Desc),strsplit,split=","),Genes)))
names(Table1a) <- c("Desc","Genes")
Then go to data.table
:
require(data.table)
DT1 <- data.table(Table1a,key="Desc")
DT2 <- data.table(Table2,key="Desc")
Then merge-n-define:
DT2[DT1,Gene:=Genes]
# Col1 Desc Gene
# 1: 3 Dentist Toothpaste
# 2: 5 Fruit Apple
# 3: 6 Fruity Candy
# 4: 4 Red Apple
# 5: 2 Sugar Candy
# 6: 1 Sweet Candy
Upvotes: 3
Reputation: 10167
Here is a method in base R that uses an intermediate linking table:
# create an intermediate data.frame with all the key (Desc) / value (Gene) pairs
df <- NULL
for(i in seq(nrow(Table1)))
df <- rbind(df,
data.frame(Gene =Table1$Genes[i],
Desc =strsplit(as.character(Table1$Desc)[i],',')[[1]]))
df
#> Gene Desc
#> 1 Apple Healthy
#> 2 Apple Red
#> 3 Apple Fruit
#> 4 Candy Sweet
#> 5 Candy Cavities
#> 6 Candy Sugar
#> 7 Candy Fruity
#> 8 Toothpaste Minty
#> 9 Toothpaste Dentist
Now link to it in the usual way:
Table2$Gene <- df$Gene[match(Table2$Desc,df$Desc)]
Upvotes: 6
Reputation: 6749
Assuming that each string is unique (i.e. that Fruit can't appear for more than one Gene), you can do this fairly easily using a for
loop and grep
. It may be slow on a huge dataset however.
options(stringsAsFactors = FALSE)
Table1 <- data.frame("Random" = c("A", "B", "C"), "Genes" = c("Apple", "Candy", "Toothpaste"), "Extra" = c("Up", "", "Down"), "Desc" = c("Healthy,Red,Fruit", "Sweet,Cavities,Sugar,Fruity", "Minty,Dentist"))
Table2 <- data.frame("Col1" = c(1, 2, 3, 4, 5, 6), "Desc" = c("Sweet", "Sugar", "Dentist", "Red", "Fruit", "Fruity"))
Table2$Gene <- NA
for(x in 1:nrow(Table2)) {
Table2[x,"Gene"] <- Table1$Genes[grep(pattern = paste("\\b",Table2$Desc[x],"\\b",sep=""),x = Table1$Desc)]
}
Table2
Col1 Desc Gene
1 1 Sweet Candy
2 2 Sugar Candy
3 3 Dentist Toothpaste
4 4 Red Apple
5 5 Fruit Apple
6 6 Fruity Candy
Upvotes: 3
Reputation: 886998
You could try cSplit
from splitstackshape
to split the 'Desc' column in "Table1" and convert the dataset from 'wide' to 'long' format. The output will be a data.table
. We can use the data.table
methods for setting the key column as 'Desc' (setkey
), join with "Table2", and finally remove the columns that are not needed in the output either by selecting the columns or assigning (:=
) the unwanted columns to NULL
library(splitstackshape)
setkey(cSplit(Table1, 'Desc', ',', 'long'),Desc)[Table2[2:1]][
,c(5,4,2), with=FALSE]
# Col1 Desc Genes
#1: 1 Sweet Candy
#2: 2 Sugar Candy
#3: 3 Dentist Toothpaste
#4: 4 Red Apple
#5: 5 Fruit Apple
#6: 6 Fruity Candy
Upvotes: 8