Microdot
Microdot

Reputation: 37

How can I use a matrix of pairwise comparisons to find the best n candidates for my analysis?

I have a set of DNA sequences (strings) which I have compared against each other in a pairwise manner. Each comparison provided the exact number of similarities (how many nucleotides are the same) between the sequences and was used to fill a lower diagonal matrix. Now, I want to find the subsets of 8 sequences (all the possible groups of 8 sequences) in this matrix that have the smallest number of similarities among themselves (the pairwise similarities within these groups of 8 sequences should be as low as possible) but I do not know how to proceed...

Any help using R (preferred) or Python would be greatly appreciated!

Below is an example of my matrix: The main idea here would be to find the subsets of n sequences (for example, 2 sequences) that would share the smallest number of similarities between themselves. My original matrix is 61X61.

    seq1 seq2 seq3 seq4
seq1 NA  NA   NA   NA
seq2 1   NA   NA   NA
seq3 2    5   NA   NA
seq4 3    2   6    NA

In this example, the subset of n=2 with the smallest similarities is (seq1,seq2), with similarity = 1. The subset of n=3 would be (seq1,seq2, seq4) as the sum of their pairwise similarities is the lowest possible in this case (seq1,seq2=1, seq1,seq4=3, seq2,seq4=2; sum = 6). (I have been using the minimal sum of pairwise interactions as a target, but if it is not reachable, I would be happy just by establishing a cutoff such as: no pairwise interaction in the subset should be > 20).

Upvotes: 0

Views: 413

Answers (2)

benrussell80
benrussell80

Reputation: 347

Here's an implementation in python. Note that 61 choose 8 is going to around 3 billion so checking every possible combination, like I've done here, is going to take a while.

from itertools import combinations

# dataframe stored as df
# assuming columns and indices have same names
subsets_of_columns = combinations(df.columns, 8)

lowest = None
subset = None

for s in subsets_of_columns:
    arr = df.loc[s, s].fillna(0).values

    if lowest is None:
        lowest = arr.sum()
        subset = s
    else:
        if arr.sum() < lowest:
            subset = list(s)
            lowest = arr.sum()

print(subset, lowest)

Upvotes: 1

teofil
teofil

Reputation: 2384

Not sure I entirely understand the task, and I might be oversimplifying, but here is an attempt.

# some test data
seqs <- matrix(nrow = 10, ncol=10)
x <- length(seqs[lower.tri(seqs)])
seqs[lower.tri(seqs)] <- sample.int(n = 5, size = x, replace = TRUE)
nms <- paste("seq", 1:10, sep="")
rownames(seqs) <- colnames(seqs) <- nms

# all combinations of 4 sequences
all_4 <- combn(x = nms, 4, simplify = FALSE)
names(all_4) <- paste("mat", 1:length(all_4), sep="_")

# a function to subset the matrix to a smaller one
submat <- function(mat, cols) {
  mat[cols, cols]
}

mats_4 <- lapply(all_4, function(x) submat(seqs, x))
# similarity per smaller matrix 
mats_4_dist <- sapply(mats_4, sum, na.rm=TRUE)
# index of those matrices with similarity < 20
mats_4_lt20_ind <- mats_4_dist < 20
# extract those matrices
mats_4_lt20 <- mats_4[mats_4_lt20_ind]

# alternatively, find the matrices with the minimal sum
mats_4_min <- mats_4[which.min(mats_4_dist)]

EDIT: I did not test this approach with a 61x61 matrix and 8x8 submatrices. But I tried it after posting and it definitely runs into memory issues. i.e.

> combn(61, 8)
Error in matrix(r, nrow = len.r, ncol = count) : 
  invalid 'ncol' value (too large or NA)
In addition: Warning message:
In combn(61, 8) : NAs introduced by coercion to integer range

Upvotes: 1

Related Questions