Reputation: 37
I have a set of DNA sequences (strings) which I have compared against each other in a pairwise manner. Each comparison provided the exact number of similarities (how many nucleotides are the same) between the sequences and was used to fill a lower diagonal matrix. Now, I want to find the subsets of 8 sequences (all the possible groups of 8 sequences) in this matrix that have the smallest number of similarities among themselves (the pairwise similarities within these groups of 8 sequences should be as low as possible) but I do not know how to proceed...
Any help using R (preferred) or Python would be greatly appreciated!
Below is an example of my matrix: The main idea here would be to find the subsets of n sequences (for example, 2 sequences) that would share the smallest number of similarities between themselves. My original matrix is 61X61.
seq1 seq2 seq3 seq4
seq1 NA NA NA NA
seq2 1 NA NA NA
seq3 2 5 NA NA
seq4 3 2 6 NA
In this example, the subset of n=2 with the smallest similarities is (seq1,seq2), with similarity = 1. The subset of n=3 would be (seq1,seq2, seq4) as the sum of their pairwise similarities is the lowest possible in this case (seq1,seq2=1, seq1,seq4=3, seq2,seq4=2; sum = 6). (I have been using the minimal sum of pairwise interactions as a target, but if it is not reachable, I would be happy just by establishing a cutoff such as: no pairwise interaction in the subset should be > 20).
Upvotes: 0
Views: 413
Reputation: 347
Here's an implementation in python. Note that 61 choose 8 is going to around 3 billion so checking every possible combination, like I've done here, is going to take a while.
from itertools import combinations
# dataframe stored as df
# assuming columns and indices have same names
subsets_of_columns = combinations(df.columns, 8)
lowest = None
subset = None
for s in subsets_of_columns:
arr = df.loc[s, s].fillna(0).values
if lowest is None:
lowest = arr.sum()
subset = s
else:
if arr.sum() < lowest:
subset = list(s)
lowest = arr.sum()
print(subset, lowest)
Upvotes: 1
Reputation: 2384
Not sure I entirely understand the task, and I might be oversimplifying, but here is an attempt.
# some test data
seqs <- matrix(nrow = 10, ncol=10)
x <- length(seqs[lower.tri(seqs)])
seqs[lower.tri(seqs)] <- sample.int(n = 5, size = x, replace = TRUE)
nms <- paste("seq", 1:10, sep="")
rownames(seqs) <- colnames(seqs) <- nms
# all combinations of 4 sequences
all_4 <- combn(x = nms, 4, simplify = FALSE)
names(all_4) <- paste("mat", 1:length(all_4), sep="_")
# a function to subset the matrix to a smaller one
submat <- function(mat, cols) {
mat[cols, cols]
}
mats_4 <- lapply(all_4, function(x) submat(seqs, x))
# similarity per smaller matrix
mats_4_dist <- sapply(mats_4, sum, na.rm=TRUE)
# index of those matrices with similarity < 20
mats_4_lt20_ind <- mats_4_dist < 20
# extract those matrices
mats_4_lt20 <- mats_4[mats_4_lt20_ind]
# alternatively, find the matrices with the minimal sum
mats_4_min <- mats_4[which.min(mats_4_dist)]
EDIT: I did not test this approach with a 61x61 matrix and 8x8 submatrices. But I tried it after posting and it definitely runs into memory issues. i.e.
> combn(61, 8)
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
In addition: Warning message:
In combn(61, 8) : NAs introduced by coercion to integer range
Upvotes: 1