roblanf
roblanf

Reputation: 1803

Calculating a similarity/difference matrix from equal length strings in Python

I have pairs of equal-length strings in Python, and an accepted alphabet. Not all of the letters in the strings will come from the accepted alphabet. E.g.

str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'

What I want to get is a numpy matrix that describes the similarities and differences between the strings. I.e. columns of the matrix are letters from the accepted alphabet in str1 and rows are letters from the accepted alphabet in str2. The entries are the sum of the number of times that str1 and str2 contain the relevant letters. I only care about cases where both strings have accepted letters at a given position.

So, for the example above, my output would be (cols are str1, rows are str2, names are in alphabetical order starting top left):

# cols & rows both refer to 'A', 'C', 'G', 'T' starting top left
# cols are str1, rows are str2

array([[ 1,  1,  0,  1],
       [ 0,  0,  0,  0],
       [ 0,  0,  1,  0],
       [ 1,  0,  0,  0]])

I can brute force this by going through every possible pair of solutions, but I'd like to know if anyone has hints (or links) towards a more general solution. E.g. is there are a way to get closer to something where I can define an alphabet of N unique characters, and get out an N-by-N matrix given two equal-length input strings?

Brute force approach:

def matrix(s1,s2):
    m= np.zeros((4,4))
    for i in range(len(s1)):
        if s1[i]==s2[i]:
            if s1[i]=="A":
                m[0,0]=m[0,0]+1
            elif s1[i]=="C":
                m[1,1]=m[1,1]+1
            elif s1[i]=="G":
                m[2,2]=m[2,2]+1
            elif s1[i]=="T":
                m[3,3]=m[3,3]+1
        elif s1[i]=="A":
            if s2[i]=="C":
                m[1,0]=m[1,0]+1
            elif s2[i]=="G":
                m[2,0]=m[2,0]+1
            elif s2[i]=="T":
                m[3,0]=m[3,0]+1
        elif s1[i]=="C":
            if s2[i]=="A":
                m[0,1]=m[0,1]+1
            elif s2[i]=="G":
                m[2,1]=m[2,1]+1
            elif s2[i]=="T":
                m[3,1]=m[3,1]+1
        elif s1[i]=="G":
            if s2[i]=="A":
                m[0,2]=m[0,2]+1
            elif s2[i]=="C":
                m[1,2]=m[1,2]+1
            elif s2[i]=="T":
                m[3,2]=m[3,2]+1
        elif s1[i]=="T":
            if s2[i]=="C":
                m[1,3]=m[1,3]+1
            elif s2[i]=="G":
                m[2,3]=m[2,3]+1
            elif s2[i]=="A":
                m[0,3]=m[0,3]+1           
    return m

Upvotes: 0

Views: 525

Answers (2)

Daniel F
Daniel F

Reputation: 14399

Using dot product of boolean matrices (easiest way to keep the order right):

def simMtx(a, x, y):
    a = np.array(list(a))
    x = np.array(list(x))
    y = np.array(list(y))
    ax = (x[:, None] == a[None, :]).astype(int)
    ay = (y[:, None] == a[None, :]).astype(int)
    return np.dot(ay.T, ax)

simMtx(alphabet, str1, str2)
Out[183]: 
array([[1, 1, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0]])

Upvotes: 2

Stephen Rauch
Stephen Rauch

Reputation: 49804

This task can be done succinctly using set, a few comprehensions and a pandas.DataFrame as:

Code:

from collections import Counter
import pandas as pd

def dot_product(allowed, s1, s2):
    in_s1 = {c: set([y.start() for y in [
        x for x in re.finditer(c, s1)]]) for c in allowed}
    in_s2 = {c: set([y.start() for y in [
        x for x in re.finditer(c, s2)]]) for c in allowed}
    return pd.DataFrame(
        [[len(in_s1[c1] & in_s2[c2]) for c1 in allowed] for c2 in allowed],
        columns=list(allowed),
        index=list(allowed),
    )

Test Code:

str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'

print(dot_product_sum(alphabet, str1, str2))

Results:

   A  C  G  T
A  1  1  0  1
C  0  0  0  0
G  0  0  1  0
T  1  0  0  0

Upvotes: 1

Related Questions