Calculating a similarity/difference matrix from equal length strings in Python

Question

I have pairs of equal-length strings in Python, and an accepted alphabet. Not all of the letters in the strings will come from the accepted alphabet. E.g.

str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'

What I want to get is a numpy matrix that describes the similarities and differences between the strings. I.e. columns of the matrix are letters from the accepted alphabet in str1 and rows are letters from the accepted alphabet in str2. The entries are the sum of the number of times that str1 and str2 contain the relevant letters. I only care about cases where both strings have accepted letters at a given position.

So, for the example above, my output would be (cols are str1, rows are str2, names are in alphabetical order starting top left):

# cols & rows both refer to 'A', 'C', 'G', 'T' starting top left
# cols are str1, rows are str2

array([[ 1,  1,  0,  1],
       [ 0,  0,  0,  0],
       [ 0,  0,  1,  0],
       [ 1,  0,  0,  0]])

I can brute force this by going through every possible pair of solutions, but I'd like to know if anyone has hints (or links) towards a more general solution. E.g. is there are a way to get closer to something where I can define an alphabet of N unique characters, and get out an N-by-N matrix given two equal-length input strings?

Brute force approach:

def matrix(s1,s2):
    m= np.zeros((4,4))
    for i in range(len(s1)):
        if s1[i]==s2[i]:
            if s1[i]=="A":
                m[0,0]=m[0,0]+1
            elif s1[i]=="C":
                m[1,1]=m[1,1]+1
            elif s1[i]=="G":
                m[2,2]=m[2,2]+1
            elif s1[i]=="T":
                m[3,3]=m[3,3]+1
        elif s1[i]=="A":
            if s2[i]=="C":
                m[1,0]=m[1,0]+1
            elif s2[i]=="G":
                m[2,0]=m[2,0]+1
            elif s2[i]=="T":
                m[3,0]=m[3,0]+1
        elif s1[i]=="C":
            if s2[i]=="A":
                m[0,1]=m[0,1]+1
            elif s2[i]=="G":
                m[2,1]=m[2,1]+1
            elif s2[i]=="T":
                m[3,1]=m[3,1]+1
        elif s1[i]=="G":
            if s2[i]=="A":
                m[0,2]=m[0,2]+1
            elif s2[i]=="C":
                m[1,2]=m[1,2]+1
            elif s2[i]=="T":
                m[3,2]=m[3,2]+1
        elif s1[i]=="T":
            if s2[i]=="C":
                m[1,3]=m[1,3]+1
            elif s2[i]=="G":
                m[2,3]=m[2,3]+1
            elif s2[i]=="A":
                m[0,3]=m[0,3]+1           
    return m

Daniel F · Accepted Answer

Using dot product of boolean matrices (easiest way to keep the order right):

def simMtx(a, x, y):
    a = np.array(list(a))
    x = np.array(list(x))
    y = np.array(list(y))
    ax = (x[:, None] == a[None, :]).astype(int)
    ay = (y[:, None] == a[None, :]).astype(int)
    return np.dot(ay.T, ax)

simMtx(alphabet, str1, str2)
Out[183]: 
array([[1, 1, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0]])

Calculating a similarity/difference matrix from equal length strings in Python

Answers (2)

Related Questions