Reputation: 455
I made up a dataframe to explain my question, my real dataset is much bigger.
gene <- c("a", "b", "c", "a", "b", "c", "a", "b", "c")
sample <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")
expression <- c("5", "6", "8", "3", "5", "7", "7", "8", "9")
data.frame(gene, sample, expression)
gene sample expression
1 a a 5
2 b a 6
3 c a 8
4 a b 3
5 b b 5
6 c b 7
7 a c 7
8 b c 8
9 c c 9
and
gene2 <- c("a", "b", "c", "a", "b", "c", "a", "b", "c")
sample2 <- c("1", "1", "1", "2", "2", "2", "3", "3", "3")
expression2 <- c("5.4", "6.3", "8", "3.2", "5.4", "7.2", "7.1", "8.2", "9.4")
data.frame(gene2, sample2, expression2)
gene2 sample2 expression2
1 a 1 5.4
2 b 1 6.3
3 c 1 8
4 a 2 3.2
5 b 2 5.4
6 c 2 7.2
7 a 3 7.1
8 b 3 8.2
9 c 3 9.4
So I have 2 different dataframes with different sample identifiers. But the expression data (should) be kind of the same. What I want to do is find per sample the closest matching expression values and report back the corresponding sample identifiers. so it could look something like this:
gene sample sample2 expression expression2
1 a a 1 5 5.4
2 b a 1 6 6.3
3 c a 1 8 8
4 a b 2 3 3.2
5 b b 2 5 5.4
6 c b 2 7 7.2
7 a c 3 7 7.1
8 b c 3 8 8.2
9 c c 3 9 9.4
I would think maybe a roll join
but im kind of lost on this
Upvotes: 0
Views: 75
Reputation: 5232
You can use split
(to compare genes), outer
(to create distance matrix) and apply
(for each row find column which has minimum value). Using mapply
you can wrap everything together:
data:
df1 <- data.frame(gene, sample, expression, stringsAsFactors = FALSE)
df2 <- data.frame(gene2, sample2, expression2, stringsAsFactors = FALSE)
df1$expression <- as.numeric(df1$expression)
df2$expression2 <- as.numeric(df2$expression2)
code:
do.call(
rbind,
mapply(
function(x, y){
j <- apply(
abs(outer(x$expression, y$expression2, FUN = "-")), 1, which.min
)
cbind(x, y[j,])
},
split(df1, df1$gene),
split(df2, df2$gene2),
SIMPLIFY = FALSE
)
)
Upvotes: 0
Reputation: 52004
You can do a rolling join with data.table
:
library(data.table)
setDT(df1)[, expression := as.numeric(expression)]
setDT(df2)[, ":="(sample = unique(df1$sample)[as.numeric(sample2)],
gene = gene2,
expression = as.numeric(expression2))]
df <- df2[df1, on = .(gene, sample, expression), roll = "nearest"][, gene2 := NULL][]
setcolorder(df, rev(seq_along(df)))
df
# gene expression sample expression2 sample2
# 1: a 5 a 5.4 1
# 2: b 6 a 6.3 1
# 3: c 8 a 8 1
# 4: a 3 b 3.2 2
# 5: b 5 b 5.4 2
# 6: c 7 b 7.2 2
# 7: a 7 c 7.1 3
# 8: b 8 c 8.2 3
# 9: c 9 c 9.4 3
Upvotes: 1