szmple
szmple

Reputation: 455

Matching samples in R

I made up a dataframe to explain my question, my real dataset is much bigger.

gene <- c("a", "b", "c", "a", "b", "c", "a", "b", "c")
sample <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")
expression <- c("5", "6", "8", "3", "5", "7", "7", "8", "9")
data.frame(gene, sample, expression)

  gene sample expression
1    a      a          5
2    b      a          6
3    c      a          8
4    a      b          3
5    b      b          5
6    c      b          7
7    a      c          7
8    b      c          8
9    c      c          9

and

gene2 <- c("a", "b", "c", "a", "b", "c", "a", "b", "c")
sample2 <- c("1", "1", "1", "2", "2", "2", "3", "3", "3")
expression2 <- c("5.4", "6.3", "8", "3.2", "5.4", "7.2", "7.1", "8.2", "9.4")
data.frame(gene2, sample2, expression2)

  gene2 sample2 expression2
1     a       1         5.4
2     b       1         6.3
3     c       1           8
4     a       2         3.2
5     b       2         5.4
6     c       2         7.2
7     a       3         7.1
8     b       3         8.2
9     c       3         9.4

So I have 2 different dataframes with different sample identifiers. But the expression data (should) be kind of the same. What I want to do is find per sample the closest matching expression values and report back the corresponding sample identifiers. so it could look something like this:

  gene sample sample2 expression expression2
1    a      a       1          5         5.4
2    b      a       1          6         6.3
3    c      a       1          8           8
4    a      b       2          3         3.2
5    b      b       2          5         5.4
6    c      b       2          7         7.2
7    a      c       3          7         7.1
8    b      c       3          8         8.2
9    c      c       3          9         9.4

I would think maybe a roll join but im kind of lost on this

Upvotes: 0

Views: 75

Answers (2)

det
det

Reputation: 5232

You can use split (to compare genes), outer (to create distance matrix) and apply (for each row find column which has minimum value). Using mapply you can wrap everything together:

data:

df1 <- data.frame(gene, sample, expression, stringsAsFactors = FALSE)
df2 <- data.frame(gene2, sample2, expression2, stringsAsFactors = FALSE)

df1$expression <- as.numeric(df1$expression)
df2$expression2 <- as.numeric(df2$expression2)

code:

do.call(
  rbind,
  mapply(
    function(x, y){
      j <- apply(
        abs(outer(x$expression, y$expression2, FUN = "-")), 1, which.min
      )
      cbind(x, y[j,])
    },
    split(df1, df1$gene),
    split(df2, df2$gene2),
    SIMPLIFY = FALSE
  )
)

Upvotes: 0

Ma&#235;l
Ma&#235;l

Reputation: 52004

You can do a rolling join with data.table:

library(data.table)
setDT(df1)[, expression := as.numeric(expression)]
setDT(df2)[, ":="(sample = unique(df1$sample)[as.numeric(sample2)],
                  gene = gene2,
                  expression = as.numeric(expression2))]


df <- df2[df1, on = .(gene, sample, expression), roll = "nearest"][, gene2 := NULL][]
setcolorder(df, rev(seq_along(df)))
df

#    gene expression sample expression2 sample2
# 1:    a          5      a         5.4       1
# 2:    b          6      a         6.3       1
# 3:    c          8      a           8       1
# 4:    a          3      b         3.2       2
# 5:    b          5      b         5.4       2
# 6:    c          7      b         7.2       2
# 7:    a          7      c         7.1       3
# 8:    b          8      c         8.2       3
# 9:    c          9      c         9.4       3

Upvotes: 1

Related Questions