Reputation: 1803
I have pairs of equal-length strings in Python, and an accepted alphabet. Not all of the letters in the strings will come from the accepted alphabet. E.g.
str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'
What I want to get is a numpy
matrix that describes the similarities and differences between the strings. I.e. columns of the matrix are letters from the accepted alphabet in str1
and rows are letters from the accepted alphabet in str2
. The entries are the sum of the number of times that str1
and str2
contain the relevant letters. I only care about cases where both strings have accepted letters at a given position.
So, for the example above, my output would be (cols are str1
, rows are str2
, names are in alphabetical order starting top left):
# cols & rows both refer to 'A', 'C', 'G', 'T' starting top left
# cols are str1, rows are str2
array([[ 1, 1, 0, 1],
[ 0, 0, 0, 0],
[ 0, 0, 1, 0],
[ 1, 0, 0, 0]])
I can brute force this by going through every possible pair of solutions, but I'd like to know if anyone has hints (or links) towards a more general solution. E.g. is there are a way to get closer to something where I can define an alphabet of N unique characters, and get out an N-by-N matrix given two equal-length input strings?
Brute force approach:
def matrix(s1,s2):
m= np.zeros((4,4))
for i in range(len(s1)):
if s1[i]==s2[i]:
if s1[i]=="A":
m[0,0]=m[0,0]+1
elif s1[i]=="C":
m[1,1]=m[1,1]+1
elif s1[i]=="G":
m[2,2]=m[2,2]+1
elif s1[i]=="T":
m[3,3]=m[3,3]+1
elif s1[i]=="A":
if s2[i]=="C":
m[1,0]=m[1,0]+1
elif s2[i]=="G":
m[2,0]=m[2,0]+1
elif s2[i]=="T":
m[3,0]=m[3,0]+1
elif s1[i]=="C":
if s2[i]=="A":
m[0,1]=m[0,1]+1
elif s2[i]=="G":
m[2,1]=m[2,1]+1
elif s2[i]=="T":
m[3,1]=m[3,1]+1
elif s1[i]=="G":
if s2[i]=="A":
m[0,2]=m[0,2]+1
elif s2[i]=="C":
m[1,2]=m[1,2]+1
elif s2[i]=="T":
m[3,2]=m[3,2]+1
elif s1[i]=="T":
if s2[i]=="C":
m[1,3]=m[1,3]+1
elif s2[i]=="G":
m[2,3]=m[2,3]+1
elif s2[i]=="A":
m[0,3]=m[0,3]+1
return m
Upvotes: 0
Views: 525
Reputation: 14399
Using dot product of boolean matrices (easiest way to keep the order right):
def simMtx(a, x, y):
a = np.array(list(a))
x = np.array(list(x))
y = np.array(list(y))
ax = (x[:, None] == a[None, :]).astype(int)
ay = (y[:, None] == a[None, :]).astype(int)
return np.dot(ay.T, ax)
simMtx(alphabet, str1, str2)
Out[183]:
array([[1, 1, 0, 1],
[0, 0, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 0]])
Upvotes: 2
Reputation: 49804
This task can be done succinctly using set
, a few comprehensions and a pandas.DataFrame
as:
Code:
from collections import Counter
import pandas as pd
def dot_product(allowed, s1, s2):
in_s1 = {c: set([y.start() for y in [
x for x in re.finditer(c, s1)]]) for c in allowed}
in_s2 = {c: set([y.start() for y in [
x for x in re.finditer(c, s2)]]) for c in allowed}
return pd.DataFrame(
[[len(in_s1[c1] & in_s2[c2]) for c1 in allowed] for c2 in allowed],
columns=list(allowed),
index=list(allowed),
)
Test Code:
str1 = 'ACGT-N?A'
str2 = 'AAGAA??T'
alphabet = 'ACGT'
print(dot_product_sum(alphabet, str1, str2))
Results:
A C G T
A 1 1 0 1
C 0 0 0 0
G 0 0 1 0
T 1 0 0 0
Upvotes: 1