KoenVdB
KoenVdB

Reputation: 423

Matrix of common character features using vectors in R

I am using the R language.

I have about 9 vectors, each which is a character vector containing 4 to 3900 gene names. I basically want to look how many genes these vectors have in common. For example,

geneList1=c("gene1","gene2","gene6","gene28")
geneList2=c("gene2","gene4","gene1")

By using the %in% operator I can check how many features both vectors have in common:

geneList2 %in% geneList1
## [1]  TRUE FALSE  TRUE

Since I have relatively large vectors, ideally I would like to look at the proportion of genes in common, i.e.

mean(geneList2 %in% geneList1)
## [1] 0.6666667

Hence, it is relatively easy for two vectors, but how about 9 vectors? There must be a better way than comparing each vector with all other vectors. Ideally, I would like some 'features in common matrix', with 1's on the diagonal (each vector has all features in common with its own) and on the off-diagonal the features in common between different vectors. Something like:

          geneList1 geneList2
geneList1 1.0000000       0.5
geneList2 0.6666667       1.0

But for multiple vectors.

Upvotes: 2

Views: 127

Answers (2)

Kasper Van Lombeek
Kasper Van Lombeek

Reputation: 633

There are possible many faster methods with the apply functions, but an old school nested for loop does the trick as well.

First make a list of all the separate genList, and loop twice over them.

genList1 <- as.character(sample(x = 1:10, size=1))
genList2 <- as.character(sample(x = 1:10, size=2))
genList3 <- as.character(sample(x = 1:10, size=3))
genList4 <- as.character(sample(x = 1:10, size=4))
genList5 <- as.character(sample(x = 1:10, size=10))
genList6 <- as.character(sample(x = 1:10, size=6))
genList7 <- as.character(sample(x = 1:10, size=7))
genList8 <- as.character(sample(x = 1:10, size=8))
genList9 <- as.character(sample(x = 1:10, size=1))

genlist <- list(genList1,geneList2,genList3,genList4,genList5,genList6,genList7,
                                    genList8,genList9)

N <- length(genelist)

commom_matrix <- matrix(0, ncol=N, nrow=N)
for(i in 1:length(genelist)){
    for(j in 1:length(genlist)){
         commom_matrix[i,j] <- mean(genlist[[i]] %in% genlist[[j]])
    }
}

Upvotes: 1

David Arenburg
David Arenburg

Reputation: 92310

The biggest problem (as I see it) is that your vectors are not from the same length and thus you can't keep them in any other form rather a list. Thus, first step would be to get them from your global environment into a list object using mget and ls combination and then compare each against all

l <- mget(ls(pattern = "geneList\\d+"))
sapply(l, function(x) lapply(l, function(y) mean(y %in% x)))
#           geneList1 geneList2
# geneList1 1         0.5      
# geneList2 0.6666667 1   

Upvotes: 1

Related Questions